

Universitì degli Studi di Padova DIPARTIMENTO DI INGEGNERIA DELL'INFORMAZIONE

Dottorato di Ricerca in Ingegneria

Elettronica e delle Telecomunicazioni

XXI Ciclo

# Design of low-power analog circuits for analog decoding and wireless sensors nodes 

Dottorando:
Silvia Solda'

SUPERVISORE:
Prof. Andrea Neviani

Coordinatore:
Ch.mo Prof. Matteo Bertocco

Padova, 31 maggio 2009

Posta elettronica : solda@dei.unipd.it

Numero telefonico : +39 0498277664

Indirizzo :
Dipartimento di Ingegneria dell'Informazione
Università degli Studi di Padova
via Gradenigo 6/B
35131 Padova
Italia

## Sommario

La prima parte di questo lavoro di tesi e' dedicata alla decodifica analogica e presenta la progettazione di un'interfaccia di I/O per un decodificatore iterattivo completamente analogico per un codice convoluzionale concatenato in serie e di un decoder analogico per Trellis Coded Modulation (TCM) per la correzione degli errori in memorie Flash multilivello. Il decodificatore iterattivo rappresenta un grosso passo avanti nell'evoluzione dei decodificatori analogici in quanto e' possibile riconfigurarne sia la lunghezza di blocco che il rate del codice. Per di piu', con un'efficienza di $2.1 \mathrm{~nJ} / \mathrm{bit}$, migliora fino a 50 volte le prestazioni in termini di efficienza dei decodificatori digitali con la stessa lunghezza di blocco. Le potenziali prestazioni e le limitazioni dell'approccio analogico per un decodificatore per TCM sono state investigate considerando due diversi decodificatori, uno a 4 stati ed uno ad 8 stati, entrambi sviluppati in un processo CMOS standard con una lunghezza di canale di $0.18 \mu \mathrm{~m}$.

Nella seconda parte della tesi viene presentato il design di un transciver per una radio ad impulsi a banda larga (UWB-IR), con particolare enfasi sulla progettazione del trasmettitore. Il trasmettitore utilizza una nuova combinazione di mixer e amplificatore di potenza per generare un impulso gaussiano con una larghezza di banda di 1.25 GHz ed una frequenza centrale di 7.875 GHz . Il nuovo circuito, inoltre, include un trasformatore monolitico in modo tale da generare una tensione di uscita di $3.2 V_{p p}$, necessaria per garantire la distanza di connessione richiesta di almeno 10 metri. Il trasformatore e' stato progettato in modo da massimizzare l'efficienza in termini di potenza e, allo stesso tempo, realizzare un filtro ladder del quarto ordine al fine di ridurre le emissioni fuori banda del trasmettitore stesso. Confrontando l'efficienza di questo design con trasmettori per UWB-IR allo stato dell'arte si e' visto come la soluzione da noi proposta porti ad un miglioramento dell'efficienza del trasmettitore di un fattore pari a 10.

## Abstract

The first part of this work concerns analog decoding. It presents the design of the I/O interface for a fully analog iterative decoder for a serially concatenated convolutional code and of a fully analog Trellis Coded Modulation (TCM) decoder for error correction in multi-level (ML) flash memories. The iterative decoder represents a significant step ahead in the evolution of analog decoders due to its reconfigurability in both block length and code rate. Moreover, with an efficiency of $2.1 \mathrm{~nJ} / \mathrm{bit}$, it outperforms digital decoders with the same block length of a factor up to 50 . The potential performance and limitations of the analog approach for a TCM decoder have been investigated considering a 4 -state and an 8 -state decoder, both developed in a $0.18 \mu \mathrm{~m}$ standard CMOS process.

In the second part of the thesis, the design of a low-power transceiver chipset for ultra wideband impulse radio (UWB-IR) is presented, with particular emphasis on the transmitter design. In particular, the transmitter uses a novel combined mixer and power amplifier to generate a Gaussian pulse with 1.25 GHz bandwith and center frequency of 7.875 GHz . The combined MRX-PA includes a monolithic transformer to reach a maximum output voltage swing of $3.2 V_{p p}$, necessary to ensure the required link distance of 10 meters. The transformer has been designed in order to maximize the power efficiency and at the same time to realize a fourth-order ladder filter, so as to reduce the transmitter outof band emissions. The efficiency of this design has been compared with state-of-the-art UWB-IR transmitters, showing how the proposed solution leads to an improvement in the transmitter efficiency of a factor of almost 10.

## Introduction

The density and speed of integrated circuit computing elements has increased roughly exponentially for a period of several decades, following a trend described by Moore's Law. While it is generally accepted that this exponential improvement trend will end, it is unclear exactly how dense and fast integrated circuits will get by the time this point is reached. Working devices have been demonstrated that were fabricated with a MOSFET transistor channel length of 6.3 nanometers using conventional semiconductor materials, and devices have been built that used carbon nanotubes as MOSFET gates, giving a channel length of approximately one nanometer.

The density and computing power of integrated circuits are limited primarily by power dissipation concerns. Even if several techniques have been developed to reduce it, however, if current trends continue, "Energy costs, now about $10 \%$ of the average IT budget, could rise to $50 \% \ldots$ by $2010 "$ [1].

In this thesis, we cope with the problem of reducing power consumption in two different key application fields: the design of decoders for both Turbo codes and multilevel Flash memories error correcting codes and the realization of low-power transceivers for ultra-wideband (UWB) impulse radio (IR).

Turbo codes, first proposed by Berroux and Glavieux in 1993 [2], have become extremely popular in the last few years till the point to be adopted as standard codes for a wide range of telecommunication applications, such as 3G/UMTS cellular phones [3] and satellite digital video broadcasting [4]. At the same time, Flash memory market growth has been explosive in the past decade, driven by cellular phones and other types of electronic portable equipment, such as palm top, portable PC, mp3 audio player, digital camera and so on. Thus, reducing the decoders power consumption for such applications is already crucial and will become more and more decisive in the next few years.

UWB-IR have become an active research area with proliferation of portable electronics, as it promises unprecedent data rates for short-range commercial radios, combined with precise locationing and high energy efficiency. These benefits stem from the use of wide bandwidths and impulse signaling, implying high channel capacity and precise time resolution [5]. A critical specification for energy efficient short-range radio is, of course, the energy per bit, that is the energy spent to transmit and receive a single information bit.

This thesis consists of two parts: in the first we address the decoding problem, while in the second the design of a chipset for UWB-IR is presented.

The pioneering work of Loeliger's [6] and Hagenauer's [7] groups led to the first successful implementations of analog iterative decoders [8-10] in BiCMOS technology, and demonstrated the potential advantages of this approach with respect to the digital one both in terms of decoding speed and power efficiency. However, the limitations of the analog implementation, due to the fully parallel decoding process, have soon become clear. The only way to make the analog decoder circuitry complexity independent on the codeword length is to reduce the fully parallelism of data processing by introducing the concept of sliding window decoding [11, 12]. Following this decoding strategy, an hybrid analog decoder for Turbo codes has been designed, which combines the advantages offered by the analog approach in terms of power consumption together with those typical of the digital implementations, such as a reduced area occupation and a greater versatility.

Using the sliding window decoding approach, the decoder interface circuitry becomes the bottleneck in terms of power consumption ad area occupation, as we will see in Chapter 1. In fact, in order to store the channel data and to exchange information during the iterative decoding process, two large power hungry analog memories are needed.

Thus, as a significant effort in the design of complex analog decoder is spent in the realization of the interface circuitry, analog decoders could be successfully used in all those applications where a memory is already implemented, such as within stand-alone Flash memory chips. Following this consideration, a novel TCM analog decoder for next generation multilevel Flash memories has been designed, with encouraging performance both in terms of area occupation and power consumption with respect to state-of-the-art digital decoders.

In Chapter 1, the architecture and main features of an iterative fully analog decoder for a serially concatenated convolutional code, which implements the sliding window concept are presented. In particular, the specifications for the decoder input interface circuitry, which consists of an analog memory, a voltage to probability converter and a digital to analog converter, are drawn and its design and optimization are then analyzed in details.

Chapter 2, after a brief introduction to multilevel Flash memories, analyzes the advantages and drawbacks of commonly used error correction codes in order to derive a novel ECC scheme based on Trellis Coded Modulation.

In Chapter 3, the complete design of an analog decoder for the TCM ECC proposed
in Chapter 2 is presented. In particular, the potential performance and limitations of the analog approach are investigated considering a 4 -state and an 8 -state analog TCM decoder, both designed for an effective storage density of 3 information bits/cell. Transistorlevel simulations of the overall decoders, including the circuit interface between the Flash memory cells and the decoder core, show how the proposed approach can achieve a decoding speed comparable with the state-of-the-art linear block codes occupying a small area, with almost no loss in terms of BER with respect to the ideal decoding algorithm.

In order not to lengthen the thesis too much, some basic concepts of coding theory are introduced in the Appendix A. The idea is to give the reader only the background information necessary to understand the projects described in this work. Thus, particular emphases is placed on the analog decoding approach for Trellis Coded Modulation and Turbo codes. Interested readers can refer to the bibliography for a more detailed discussion on coding theory.

The second part of the thesis is devoted to the design of a low-power transceiver chipset for UWB-IR.

Chapter 4 briefly introduces UWB signalling systems together with some legal aspects due to the Federal Communications Commission restriction on the transmitted power spectral density. Then the system analysis of our chipset for UWB-IR sensor networks is carried out in order to draw the specifications for both receiver and transmitter. For this purpose, a behavioral model of the receiver implemented in UMC $0.13-\mu \mathrm{m}$ RFCMOS process has been developed, which also allowed to successfully test a synchronization algorithm.

The design of the transmitter is then presented in details in Chapter 5, where the comparison with state-of-the-art UWB-IR transmitters shows how the proposed solution leads to an improvement in the transmitter efficiency of a factor of almost 10.

## Contents

Introduction ..... vii
1 Input Interface for Analog DecodersInterfaccia di ingresso per decodificatori analogici1
1.1 Hybrid Turbo Decoder
Il decodificatore Turbo ibrido ..... 1
1.1.1 Serially Concatenated Code Structure
Struttura del codice concatenato in serie ..... 1
1.1.2 Analog Decoding Procedure
Schema di decodifica analogico ..... 4
1.1.3 Hybrid Decoding Procedure
Schema di decodifica ibrido ..... 4
1.1.4 Analog Decoder Architecture
Architettura del decodificatore analogico ..... 6
1.1.5 CMOS Technology
La tecnologia CMOS ..... 7
1.2 Input Interface Circuitry
Circuiteria di interfaccia di ingresso ..... 8
1.2.1 Specifications
Specifiche ..... 8
1.2.2 Voltage-to-Probability Converter
Convertitore tensione-probabilita' ..... 10
1.2.3 Input Memory
Memoria di ingresso ..... 15
1.2.4 DAC
DAC ..... 29
1.3 Simulations Results
Risultati delle simulazioni ..... 32
1.3.1 Read Phase Simulations
Simulazioni della fase di lettura ..... 32
1.3.2 Write Phase Simulations Simulazioni della fase di scrittura ..... 37
1.3.3 Power Consumption Consumi ..... 40
1.4 Conclusions
Conclusioni ..... 40
2 Analog Decoding for Data Storage Applications Decodifica analogica per applicazioni data storage ..... 43
2.1 Flash Memories
Memorie Flash ..... 43
2.1.1 Floating Gate Transistor Celle Flash ..... 44
2.1.2 NOR and NAND Flash
Memorie Flash NAND e NOR ..... 44
2.1.3 Multilevel Flash Cell
Memorie Flash multilivello ..... 47
2.1.4 Reliability Issues
Affidabilita' ..... 48
2.2 Error Correcting Codes for Multilevel Flash Memories Codici a correzione d'errore per memorie Flash multilivello ..... 50
2.3 q-ary Hamming Codes
Codici di Hamming $q$-ari ..... 51
2.3.1 Analog Decoding
Decodifica analogica ..... 52
2.4 Convolutional Punctured Codes
Codici convoluzionali punturati ..... 55
2.4.1 Analog Decoder Implementation Implementazione del decoder analogico ..... 56
2.4.2 Performance Analysis
Analisi delle prestazioni ..... 59
2.5 Trellis Coded Modulation for Multilevel Flash Memories
Trellis Coded Modulation per memorie Flash multilivello ..... 60
2.5.1 Multilevel Flash Memories as Signal Constellations Memorie Flash multilivello come costellazioni di segnali ..... 62
2.5.2 TCM for Multilevel Flash Memories
TCM per memorie Flash multilivello ..... 63
2.5.3 Analog Decoding Algorithm
Algoritmo di decodifica analogico ..... 66
2.5.4 Performance Analysis
Analisi delle prestazioni ..... 73
3 CMOS Analog TCM Decoders: Design and Performance Analysis Decodificatori analogici per TCM in tecnologia CMOS: progettazione e prestazioni ..... 77
3.1 Cells Design
Progettazione delle celle ..... 77
3.1.1 Preliminary Considerations
Considerazioni preliminari ..... 77
3.1.2 Weak Inversion Devices
Dispositivi in debole inversione ..... 80
3.1.3 Bias Current
Corrente di polarizzazione ..... 80
3.1.4 Bias Transistor
Transistor di polarizzazione ..... 83
3.1.5 pMOS Current Mirrors
Specchi pMOS ..... 84
3.1.6 Bias Voltages
Tensioni di polarizzazione ..... 84
3.1.7 Single Cell Characterization
Caratterizzazione della singola cella ..... 85
3.2 Voltage to Probability CellConvertitore tensione-probabilita'86
3.2.1 Approximated Normalization
Normalizzazione approssimata ..... 88
3.2.2 Voltage to Probability Module Design
Progetto del modulo tensione-probabilita, ..... 90
3.3 Decoder Optimization: the Reset
Ottimizzazione del decoder: il reset ..... 95
3.4 Overall Decoder Performance
Prestazioni del decoder analogico ..... 97
3.5 Conclusions
Conclusioni ..... 99
4 UWB-IR Transceiver Chipset for Sensor Network Applications UWB-IR transceiver chipset per reti di sensori wireless ..... 101
4.1 UWB Definition
Definizione di UWB ..... 101
4.2 UWB Sensor Networks
Reti di sensori UWB ..... 103
4.3 UWB Signal Choice
Scelta del segnale UWB ..... 104
4.3.1 Bandwidth
Larghezza di banda ..... 104
4.3.2 Modulation
Modulazione ..... 105
4.3.3 Pulse Shape Analysis
Analisi della forma dell'impulso ..... 107
4.3.4 Channel Model
Modello del canale ..... 109
4.4 CMOS Technology
La tecnologia CMOS ..... 109
4.5 Receiver
Il ricevitore ..... 110
4.5.1 Behavioral Model
Modello comportamentale ..... 112
4.5.2 BER and Sensitivity BER e Sensitivity ..... 113
4.6 Synchronization Algorithm
Algoritmo di sincronizzazione ..... 114
5 Transmitter
Trasmettitore ..... 121
5.1 Specifications
Specifiche ..... 121
5.2 Architecture
Architettura ..... 127
5.3 Gaussian Pulse Generator
Generatore dell'impulso Gaussiano ..... 128
5.3.1 Transformer Design
Progetto del trasformatore ..... 130
5.3.2 Simulation Results
Risultati delle simulazioni ..... 142
5.4 Conclusions and Future Work
Conclusioni e lavoro futuro ..... 144
Conclusions ..... 147
A Fundamentals of Error Correcting Coding Introduzione alla codifica ..... 1
A. 1 The Shannon Limit
Il limite di Shannon ..... 1
A. 2 Types of Codes
Tipologie di codici ..... 2
A.2.1 Block Codes
Codici a blocco ..... 2
A.2.2 Hamming Codes
Codici di Hamming ..... 3
A.2.3 Convolutional Codes
Codici convoluzionali ..... 4
A.2.4 Convolutional Codes Trellis Diagram
Traliccio di codici convoluzionali ..... 6
A.2.5 Trellis Coded Modulation
Trellis Coded Modulation ..... 7
A.2.6 Turbo CodesTurbo codici11
A. 3 Decoding
Decodifica ..... 12
A.3.1 Viterbi AlgorithmAlgoritmo di Viterbi12
A.3.2 MAP Decision Rule
Rivelazione MAP ..... 13
A.3.3 Sum-Product Algorithm
Algoritmo Sum-Product ..... 14
A.3.4 Iterative Decoding
Decodifica iterattiva ..... 17
A.3.5 Soft-Input Soft-Output Algorithm
Algoritmo Soft-Input Soft-Output ..... 20
A. 4 Analog Decoding
Decodifica analogica ..... 23
A.4.1 Sum-Product Module
Moduli Sum-Product ..... 24
A.4.2 Soft-gates
Soft-gates ..... 25
Bibliography ..... 29

## List of Tables

1.1 Matching parameters ..... 15
1.2 Voltage to probability transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 15
1.3 Switches size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 25
1.4 OTA transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 25
1.5 Buffer transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 28
$1.6 \quad V_{\text {in }}=162 \mathrm{mV}$ and $V_{\text {in pre }}=-162 \mathrm{mV}$ ..... 34
$1.7 \quad V_{\text {in }}=-162 \mathrm{mV}$ and $V_{\text {in pre }}=162 \mathrm{mV}$ ..... 34
$1.8 \quad V_{\text {in }}=0 \mathrm{mV}$ and $V_{\text {in }}^{\text {pre }}$ $=162 \mathrm{mV}$ ..... 35
2.1 $\quad G F(4) \leftrightarrow$ binary notation ..... 53
2.2 Sum and multiplication over $G F(4)$ ..... 53
3.1 4-state decoder transistors count ..... 79
$3.2 \quad 8$-state decoder transistors count ..... 79
3.3 Transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 85
3.4 Error statistic for the 4-state decoder cells ..... 85
3.5 Error statistic for the 8 -state decoder cells ..... 86
3.6 Error statistic with MonteCarlo simulations for the 4 -state decoder cells ..... 86
3.7 Transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 92
3.8 Output currents error statistic for V2P ..... 95
4.1 FCC Mask Limits ..... 103
4.2 Comparison of different pulse shapes ..... 109
4.3 System level parameters ..... 110
4.4 LNA + VGA parameters ..... 112
4.5 Energy detector parameters ..... 112
4.6 Parameters of the receiver behavioral model ..... 114
4.7 Parameters of the receiver behavioral model ..... 115
5.1 Main transmitter parameters ..... 126
5.2 MXR-PA transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 130
5.3 Passive elements values and buffer transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$ ..... 130
5.4 Transformer parameters ..... 138
5.5 Primary inductor lumped model parameters ..... 139
5.6 Secondary inductor lumped model parameters ..... 140
5.7 Transformer lumped model parameters ..... 142
5.8 Performance summary of UWB-IR transmitters ..... 144

## List of Figures

1.1 Block diagram of the serially constituent code scheme ..... 2
1.2 Constituent encoder scheme of the SCCC Turbo code ..... 3
1.3 SCCC Turbo code decoding scheme ..... 4
1.4 Analog decoder block diagram ..... 6
1.5 SISO simplified diagram with switch between inner and outer configuration ..... 7
1.6 Input interface block diagram ..... 9
1.7 Voltage to probability converter for a binary alphabet ..... 11
1.8 Voltage to probability schematic ..... 12
1.9 Input memory block diagram ..... 17
1.10 Memory cell structure ..... 18
1.11 Read and write phase timing diagram ..... 19
1.12 OTA schematic ..... 21
1.13 Countour lines for $\omega_{C L}\left(\beta, g_{m}\right)$ ..... 22
$1.14 \beta$ as a function of $C_{S}$ ..... 23
1.15 Countour lines for $\omega_{C L}\left(C_{S}, g_{m}\right)$ ..... 24
1.16 OTA frequency responce ..... 26
1.17 Input buffer schematic ..... 27
1.18 Input memory buffer frequency responce ..... 28
1.19 Simplified DAC schematic ..... 29
1.20 DAC transient output waveforms ..... 30
1.21 Pre-layout simulations error on 600 random codewords ..... 31
1.22 Post-layout simulations error on 600 random codewords ..... 31
1.23 Post-layout simulations error on 600 random codewords ..... 32
1.24 Memory transient output waveforms ..... 36
1.25 MonteCarlo simulation results with $V_{i n}=-162 \mathrm{mV}$ and $V_{i n_{p r e}}=-162 \mathrm{mV}$ ..... 37
1.26 MonteCarlo write phase transient simulation results ..... 38
1.27 MonteCarlo DAC output waveforms ..... 38
1.28 Pre-layout simulations error on 600 random codewords ..... 39
1.29 Post-layout simulations error on 600 random codewords ..... 39
1.30 Hybrid analog decoder layout ..... 41
2.1 Flash cell cross section ..... 44
2.2 I-V curves of a floating-gate MOS without (curve A) and with (curve B) electrons stored in the floating gate ..... 45
2.3 NOR flash array ..... 45
2.4 NAND flash array ..... 46
2.5 Conceptual representation of bilevel and multilevel threshold voltage dis- tributions ..... 48
2.6 4-ary soft-XOR circuit implementation ..... 54
2.7 4-ary equal gate circuit implementation ..... 55
2.8 BER curves for Hamming $(36,32)$ code over $G F(4)$ ..... 56
2.9 Convolutional code trellis section ..... 57
2.10 Merge of two consecutive trellis sections according to the $\mathrm{B}_{\mathrm{a}}$ scheme ..... 58
2.11 Convolutional code SISO block diagram ..... 59
2.12 Trellis sections according to the merging scheme $|\times 1 \times 1|$ ..... 60
2.13 BER of rate $15 / 16$ punctured (diamonds) convolutional code, soft-decoded (stars) and hard-decoded (crosses) Hamming $(36,32)$ over GF (4) ..... 61
2.14 BER of rate $15 / 16$ punctured (diamonds) and non-punctured (circles) con- volutional code ..... 61
2.15 Analogy between memory cell threshold voltage distributions and signal constellations ..... 62
2.16 TCM encoder block diagram ..... 64
2.17 Cell mapping ..... 65
2.18 4-state TCM trellis state diagram ..... 66
2.19 8-state TCM trellis state diagram ..... 67
2.20 Block diagram of the analog decoder core ..... 68
2.21 B module ..... 70
2.22 C module ..... 71
2.23 D module ..... 72
2.24 E module ..... 72
2.25 F1 and F2 module ..... 73
2.26 BER vs SNR with analog TCM and soft Viterbi decoder ..... 74
3.1 B cell schematic for the 8 -state analog decoder ..... 78
3.2 Current-voltage characteristic for a $L=0.6 \mu \mathrm{~m} W=0.8 \mu \mathrm{~m} \mathrm{nMOS}$ transistor ..... 81
3.3 BER as a function of the decoding time ..... 83
3.4 Analog circuit for conditional probabilities generation ..... 87
3.5 Conceptual schematic for approximated conditional probabilities generation ..... 88
3.6 Voltage to probability schematic ..... 91
3.7 Conditional probabilities calculated with the approximated normalization method and ideal at $\mathrm{SNR}=24 \mathrm{~dB}$ and $\mathrm{SNR}=30 \mathrm{~dB}$ ..... 93
3.8 Conditional probabilities error at $\mathrm{SNR}=24 \mathrm{~dB}$ and $\mathrm{SNR}=30 \mathrm{~dB}$ ..... 94
3.9 Conditional probability error ..... 94
3.10 Output probabilities transient with and without reset ..... 96
3.11 BER vs SNR for a memory cell with $q=8$ without ECC (circle), a mem- ory cell with $q=16$ and 4 -state TCM C++ model (diamonds) and with TCM analog decoder (squares) ..... 97
3.12 BER vs SNR for a memory cell with $q=8$ without ECC (circle), a mem- ory cell with $q=16$ and 8 -state TCM C++ model (diamonds) and with TCM analog decoder (squares) ..... 98
3.13 4-state decoder MonteCarlo simulations at $\mathrm{SNR}=27 \mathrm{~dB}$ ..... 99
4.1 FCC emissions limit for indoor (dashed) and outdoor (solid) UWB com- munication ..... 102
4.2 Pulse position modulation ..... 105
4.3 Pulse amplitude modulation ..... 105
4.4 Binary frequency-shift keying modulation ..... 106
4.5 Binary phase-shift keying modulation ..... 106
4.6 Receiver block diagram ..... 111
4.7 Matlab receiver equivalent model ..... 113
4.8 Simulated BER as a function of the input signal power at the antenna, using the model parameters given in Table 4.6 ..... 116
4.9 Data packet structure ..... 116
4.10 Block diagram of synchronization correlator banks ..... 117
4.11 PPM signaling scheme with: (a) perfect synchronization; (b) half $T_{\text {int }}$ misalignment; (c) modified PPM scheme with half $T_{\text {int }}$ misalignment ..... 117
4.12 Simulated BER as a function of the input signal power at the antenna with $T_{i n t}=5 \mathrm{~ns}($ star $), T_{i n t}=10 \mathrm{~ns}$ (circle), $T_{i n t}=15 \mathrm{~ns}$ (square) and $T_{i n t}=30 \mathrm{~ns}$ (diamond) ..... 118
4.13 Probability of detection ..... 119
4.14 Simulated BER as a function of the input signal power at the antenna with perfect synchronization (square), original PPM synchronization (circle) and modified PPM synchronization (diamond) ..... 119
5.1 Maximum single pulse amplitude allowed by the FCC average power spectral mask as a function of the PRF ..... 122
5.2 Peak power of a maximum amplitude single pulse as a function of the PRF 123 ..... 123
5.3 Peak power as a function of the single pulse voltage amplitude for a $100 \mathrm{~kb} / \mathrm{s}$ data rate transmission ..... 123
5.4 Maximum output voltage as a function of the pulses number per bit ..... 124
5.5 Link distance as a function of the pulses number per bit with the pulses amplitude given by Fig.5.4 ..... 125
5.6 Energy/pulse and receiver (diamond), transmitter (circle) and overall (square) energy/bit as a function of the number of pulses per bit ..... 126
5.7 Transmitter block diagram ..... 127
5.8 Gaussian pulse generator ..... 128
5.9 Simplified buffer schematic ..... 129
5.10 Transformer equivalent models ..... 130
5.11 $L_{p}$ optimum values as a function of the coupling coefficient $k$ ..... 132
5.12 $L_{S}$ optimum values as a function of the coupling coefficient $k$ ..... 133
$5.13 C_{s}$ optimum values as a function of the coupling coefficient $k$ ..... 133
5.14 Transformer efficiency as a function of the coupling coefficient $k$ ..... 134
5.15 Forth-order bandpass ladder filter ..... 134
5.16 $L_{p}$ as a function of the filter $Q$ factor for different filter types ..... 135
$5.17 C_{p}$ as a function of the filter $Q$ factor for different filter types ..... 136
$5.18 C_{s}$ as a function of the filter $Q$ factor for different filter types ..... 136
$5.19 L_{S}$ values with $Q=2$ as a function of the coupling coefficient $k$ for differ- ent filter types ..... 137
5.20 Transformer layout ..... 138
5.21 Transformer primary coil lumped model ..... 139
5.22 Primary equivalent inductance, resistance and quality factor from the EM- simulation (solid) an the lumped model (dashed) ..... 140
5.23 Transformer secondary coil lumped model ..... 141
5.24 Secondary equivalent inductance, resistance and quality factor from the EM-simulation (solid) an the lumped model (dashed) ..... 141
5.25 Transformer equivalent inductance, resistance and quality factor from the EM-simulation (solid) an the lumped model (dashed) ..... 142
5.26 Transient output waveform ..... 143
5.27 Output spectrum ..... 143
A. 1 SISO simplified diagram with switch between inner and outer configuration ..... 1
A. 2 4-state rate $1 / 2$ binary convolutional encoder ..... 5
A. 3 Systematic version of code of Fig.A. 2 ..... 5
A. 4 4-state transition diagram of code of Fig.A. 2 ..... 6
A. 5 4-state trellis diagram of code of Fig.A. 2 ..... 6
A. 6 4-state terminated trellis diagram of code of Fig.A. 2 ..... 7
A. 7 Trellis section of code of Fig.A. 2 ..... 7
A. 8 Signal constellations ..... 8
A. 9 Trellis section of code of Fig.A. 2 ..... 9
A. 10 TCM encoder ..... 10
A. 11 Parallel concatenated convolutional code block diagram ..... 11
A. 12 Serial concatenated convolutional code block diagram ..... 12
A. $13(7,4,2)$ code tanner graph ..... 14
A. 14 Example of weights propagation through a function node ..... 15
A. 15 Example of weights propagation through a variable node ..... 16
A. 16 Iterative decoding scheme ..... 17
A. 17 Soft-Input Soft-Output module ..... 20
A. 18 An edge of the trellis section ..... 20
A. 19 Analog decoders building block ..... 24
A. 20 Basic circuit for Sum-Product module implementation ..... 25
A. 21 Soft-XOR trellis and circuit implementation ..... 26
A. 22 Transistor network implementation of a general Sum-Product module ..... 27

## 1

## Input Interface for Analog Decoders

Although analog decoders demonstrated large improvements in terms of power consumption, chip area and throughput with respect to their digital counterparts [8,9,13, 14], they are still far from being used in realistic applications due mainly to the fact that their circuitry complexity increases linearly with the codeword length. The only possible solution to this limitation is to reduce the fully parallelism of data processing by introducing the concept of sliding window decoding [11,12].

In this chapter, the architecture and main features of an iterative fully analog decoder for a serially concatenated convolutional code, which implements the sliding window concept, are presented. In particular, the specifications for the decoder input interface circuitry, which consists of an analog memory, a voltage to probability converter and a digital to analog converter, are drawn and then its design and optimization are analyzed in details.

### 1.1 Hybrid Turbo Decoder

The input interface circuitry whose design will be described in Sec.1.2.3, is part of an hybrid Turbo decoder which combines the advantages of the analog approach, by implementing a fully analog decoding core, together with some features typical of the digital implementations, such as the iterative use of the same computational hardware unit and an increased flexibility both in terms of code rate and frame length.

### 1.1.1 Serially Concatenated Code Structure

The hybrid Turbo decoder is designed to decode in the analog domain a novel serially concatenated convolutional code recently proposed in literature for satellite applications


Figure 1.1: Block diagram of the serially constituent code scheme
[15] due to its high performance, simplicity and versatility.
The code block diagram is depicted in Fig.1.1. It consists of the concatenation of two identical rate-1/2 4-state systematic recursive convolutional codes with generator polynomials:

$$
\begin{equation*}
G(D)=\left[D^{2}+1, D^{2}+D+1\right] \tag{1.1}
\end{equation*}
$$

The two encoders scheme, together with a code trellis section, are shown in Fig.1.2.
A way to increase any code rate is offered by "puncturization". A punctured code is a higher rate code formed by discarding or puncturing specific codeword symbols of the output of a low-rate code. Given the original low-rate code, the resulting punctured code depends on the pattern of codeword symbols being discarded. This pattern is called the perforation pattern of the punctured code and it can be conveniently defined by a perforation matrix $P$ with elements

$$
p_{i, j}= \begin{cases}0 & \text { if symbol } i \text { of the branch } j \text { is punctured } \\ 1 & \text { if symbol } i \text { of the branch } j \text { is retained }\end{cases}
$$

In our case, the outer code is punctured to rate $2 / 3$ through the puncturing matrix

$$
P_{o}=\left(\begin{array}{llll}
1 & 1 & 1 & 1  \tag{1.2}\\
1 & 1 & 0 & 0
\end{array}\right)
$$

while the inner code systematic bits and parity bits are punctured separately through puncturers $P_{s}$ and $P_{p}$, as described in detail in [15]. By changing the perforation patter of the two puncturers $P_{s}$ and $P_{p}$ in a rate-compatible fashion, the overall code rate can be varied between $1 / 3$ and (virtually) 1 .

Let's define $K, N$ and $L$ the information block size, the interleaver size and the codeword size respectively. The binary information data, collected in a vector $\mathbf{u}$ of length


Figure 1.2: Constituent encoder scheme of the SCCC Turbo code
$K$, is first encoded by the outer encoder, then punctured by the fixed puncturer $P_{o}$ and interleaved before being forwarded to the inner encoder.

The output of the inner encoder is then punctured by the puncturers $P_{s}$ and $P_{p}$, so as to generate the two output streams $\mathbf{c}_{s}$ and $\mathbf{c}_{p}$. These vectors are finally multiplexed to form the $L$-bit codeword sequence $\mathbf{c}$.

The coded sequence $\mathbf{c}$ is sent to the channel as the signal sequence $\mathbf{x}$ obtained at the output of a binary pulse-amplitude modulation PAM or, equivalently, a BPSK modulator. The relationship between the modulated signal and the coded bits is:

$$
\begin{equation*}
x_{k}=\sqrt{E_{S}}\left(2 c_{k}-1\right) \tag{1.3}
\end{equation*}
$$

having denoted by $E_{S}$ the signal energy. The channel is assumed an additive white Gaussian noise (AWGN) channel with two-sided noise power spectral density $N_{0} / 2$.

Since the rate of the outer code punctured through $P_{o}$ is $2 / 3$, the interleaver length is given by $N=3 K / 2$. We assume the interleaver length to be a multiple of a base interleaver size $N^{\prime}$, which, as described in [15], is fixed to 300 bit. Thus, for larger interleaver lengths, 4 multiples of $N^{\prime}$ are considered, that is interleaver size of $300,600,1200$ and 2400 bits. The corresponding information block sizes can be deduced as:

$$
\begin{equation*}
K=\frac{2}{3} N-2=\frac{2}{3} M \cdot N^{\prime}-2 \tag{1.4}
\end{equation*}
$$

where $M=1,2,3,4$ and the two termination bits, added to reset the encoder memory to the original state, are also taken into account.

The codeword length $L$ can be calculated by recalling that two parity bits $c_{k, 1}$ and $c_{k, 2}$ are generated for each information bit $u_{k}$ which, together with the two termination bits, gives:

$$
\begin{equation*}
L=M \cdot\left(N^{\prime}+2\right) \tag{1.5}
\end{equation*}
$$

which leads to $L=604,1208,2416,4832$ for the 4 interleaver sizes considered.
The channel date-rate is fixed at $100 \mathrm{Mb} / \mathrm{s}$.


Figure 1.3: SCCC Turbo code decoding scheme

### 1.1.2 Analog Decoding Procedure

The fully parallel analog Turbo decoding procedure is illustrated in Fig.1.3.
The two SISO modules, each corresponding to one of the two encoders reported in Fig.1.2, perform the decoding by exchanging extrinsic soft-information. These softinformation can be regarded as an index of the confidence level reached by the a posteriori estimate of the original information bits $u_{k}$. It is worth to highlight that the two SISO units, the inner and the outer SISO, which correspond to the inner and outer encoder respectively, work on input data streams with different length and typology.

The inner SISO soft-inputs are:

- the channel output symbols $y_{k}^{s}$ transition probabilities $p_{c h}\left(y_{k}^{s}\right)=p\left(y_{k}^{s} \mid x_{k}^{s}\right)$ with $s=$ $u, c_{1}, c_{2}$;
- the additional information on $u_{k}$ and $c_{1}$ given by the outer SISO as extrinsic probability $\tilde{p}\left(y_{k}\right)_{\text {outer }}$.

The outer SISO receives from the inner one the bits $u_{k}$ and $c_{1}$ extrinsic information $\tilde{p}\left(y_{k}\right)_{\text {inner }}$ and produces two outputs:

- an a posteriori $\hat{u}_{k}$ estimate of the user bits $u_{k}$;
- new extrinsic information on the $u_{k}$ and $c_{1}$ symbols which constitute the input for the inner SISO.


### 1.1.3 Hybrid Decoding Procedure

A fully-parallel analog implementation of the decoding algorithm described in Sec.1.1.2 would require two SISO modules with length 2400 and 1600 respectively, so as to handle the maximum codeword input length of 4800 bits. This would lead to a prohibitively large chip area.

To reduce the decoder complexity, a non fully-parallel approach is adopted. The basic idea is to implement only one single SISO unit, which is used to decode both the outer and the inner code. In addition, the SISO unit implements only a window of the overall code trellis of $N^{\prime}$ trellis sections. Thus, for larger block lengths the module is reused several times to decode each of the constituent codes.

The choice of the window dimension is a tradeoff between circuit complexity and speed. Increasing $N^{\prime}$ increases the circuit complexity but, at the same time, allows higher decoding parallelism. As the time required by the SISO to converge to stable values is roughly independent of the block size, as proved by simulations, the decoding speed increases linearly with the size of the window. As a good tradeoff between circuit speed and complexity, we fixed $N^{\prime}=300$ trellis sections, which corresponds to the shortest codeword considered with length $L=604$ bits.

To explain the hybrid decoding process, an interleaver size of 2400 bits, corresponding to 8 subblocks of minimum length $N^{\prime}=300$, is assumed. The decoding procedure can be summarized in the following few steps:

1. the decoder loads the channel information relative to the intere frame onto an analog input memory;
2. during the first half iteration, the SISO unit works as inner SISO. For each subblock of $N^{\prime}=300$ bits the SISO is fed with the corresponding channel transition probabilities $p_{c h}\left(y_{k}^{s}\right)$ and the extrinsic information generated by the "outer SISO" during the previous half iteration on the inner code input bits $\tilde{p}\left(y_{k}\right)_{\text {outer }}$. At its output it generates the extrinsic information $\tilde{p}\left(y_{k}\right)_{\text {inner }}$ on the 300 input bits which are then stored following the natural order in the first row of an extrinsic memory. This process is repeated 8 times, one for each subblock, until all the 8 rows of the extrinsic memory are loaded with the corresponding extrinsic values.
3. during the second half iteration, the SISO unit works as outer SISO. Thus, for each subblock, it is fed with the extrinsic information $\tilde{p}\left(y_{k}\right)_{\text {inner }}$ on the outer code output bits, properly deinterleaved and punctured, generated during the previous half iteration by the "inner SISO" and with the extrinsic information on the outer code input bits stored in the extrinsic memory. At its output the SISO unit generates the a posteriori probabilities of the user bits on which decisions on the transmitted symbols are taken. The SISO also computes the extrinsic information $\tilde{p}\left(y_{k}\right)_{\text {outer }}$ on the outer code output bits which will be stored, properly interleaved and punctured,


Figure 1.4: Analog decoder block diagram
in the extrinsic analog memory. This process is repeated for each subblock until the extrinsic values for the entire block are stored in the memory.
4. Step 2 and 3 are repeated as many times as the maximum iterations number $N_{t}=10$. At the last iteration, decisions on the transmitted bits are taken.

### 1.1.4 Analog Decoder Architecture

A block diagram of the analog decoder implementing the hybrid decoding algorithm described in Sec.1.1.3 is shown in Fig.1.4. As already pointed out, it consists of a single SISO unit, which has to match both the inner and the outer code.

More in details, the SISO analog network has been designed following the approach described by Loeliger in [6] and successfully adopted in several analog decoder prototypes [ $8,13,14,16]$. It consists of several sum-product cells which operate on two current input vectors $I_{x}$ and $I_{y}$ representing the probability distributions $p(x)$ and $p(y)$ of discrete random variables and yield one output current vector $I_{z}$, according to (A.53). Thus, each component of the output vector is the sum of products of input vectors component pairs.

As already pointed out, the iterative nature of the proposed decoder requires the SISO to match both the outer and the inner code. Since the trellis length is different in the two cases, as it consists of 200 sections for the outer code which become 300 for the inner code, an appropriate combination of switches is added to the basic scheme as reported


Figure 1.5: SISO simplified diagram with switch between inner and outer configuration
in Fig.1.5 in order to move the SISO termination in the proper position when the shorter length is needed.

Furthermore, to allow the exchange of the extrinsic information during the iterative decoding process, few multiplexers and an analog extrinsic memory are added. This memory must be able to store up to 2400 couples of soft values, corresponding to the extrinsic probabilities $\tilde{p}(0)$ and $\tilde{p}(1)$. The extrinsic memory is organized as a bank of 8 rows, that is the maximum subblocks number, and 300 columns, which correspond to the subblock length.

The input memory is needed to store the channel information during the entire decoding process. Its sizing, design and optimization will be described in Sec.1.2.3.

### 1.1.5 CMOS Technology

The hybrid analog decoder has been designed in the UMC $0.18-\mu \mathrm{m}$ CMOS process. The key features of this technology are:

- minimum channel length: $0.18-\mu \mathrm{m}$;
- dual supply voltage: 1.8 V and 3.3 V ;
- P-substrate;
- single poly, six metal layers (1P6M);
- Twin-Well to realize nMOS with isolated substrate;
- high performance mixed-mode signal capabilities;
- radio frequency MOS transistors.

In order to optimize the decoder power consumption, we used only devices working at the lower supply voltage, that is $V_{d d}=1.8 \mathrm{~V}$.

### 1.2 Input Interface Circuitry

The input interface, whose block diagram is shown in Fig.1.6, has to store a frame of channel output symbols $y_{k}$ on the input analog memory and then to convert each of them into a pairs of currents representing the corresponding channel transition probabilities $p_{c h}\left(y_{k}\right)=p\left(y_{k} \mid x_{k}\right)$. In order to ease the testing phase, the channel output is fed to the memory in a 7 -bit digital representation and it is converted to a differential voltage by a digital-to-analog converter.

### 1.2.1 Specifications

The input interface specifications have been drawn in order to guarantee the proper functioning of the overall analog decoder and also to make the whole system performance competitive with respect to the corresponding digital implementations. In particular, the memory write and access time together with its accuracy have been set so as to ensure a correct decoding process, while the specifications regarding the power consumption and area occupation are deduced from performance comparison with digital decoders.

Write and access time As the channel data rate is equal to $100 \mathrm{Mb} / \mathrm{s}$, the demodulator output is sampled at a frequency of 100 MHz , which leads to a maximum memory write time of 10 ns . The access time specification can be set by recalling that every decoding iteration lasts for 300ns and that, to speed up the decoding process, the correct channel transition probabilities values must be available at the SISO input in the shortest time. Thus, a reasonable value for the access time would be in the range of $10 \div 30 \mathrm{~ns}$, that is at least one order of magnitude smaller than the decoding time.


Figure 1.6: Input interface block diagram

Precision The accuracy specification has been drawn from the precision requirements for fixed point implementations of iterative decoders for concatenated codes with interleaver [17]. In particular, given the total number of bits $n_{b}$ used for the log-likelihood ratio LLR numeric representation, where the LLR for the binary alphabet $\{0,1\}$ is defined as:

$$
\begin{equation*}
\lambda_{k}=\log \frac{p\left(y_{k} \mid c_{k}=1\right)}{p\left(y_{k} \mid c_{k}=0\right)} \tag{1.6}
\end{equation*}
$$

the required number of bits of precision depends on $n_{b}$. For realistic applications, two cases have been considered, that is $n_{b}=5$ and $n_{b}=4$, leading to the following design hints:

- for $n_{b}=4$, the best choice is $(4,1)$ using one bit of precision and three for the dynamic. It yields performance less than 0.1 dB worse than the ideal ones for low $E_{b} / N_{0}$;
- for $n_{b}=5$, the best choice is $(5,2)$ using two bits of precision and three for the dynamic. It yields performance almost identical to the ideal one.

Power consumption One of the key element to judge the performance of the hybrid decoder is definitely its efficiency, that is measured in terms of energy per decoded bit. This latter is defined as:

$$
\begin{equation*}
\text { Energy } / \text { dec }=\frac{\text { Power consumption }}{\text { Input bit rate } \times \text { Code rate }}=\frac{\text { Power consumption }}{\text { Out put bit rate }} \tag{1.7}
\end{equation*}
$$

In the UMTS Turbo decoder presented in [14], the energy per decoded bit has a minimum value of $12.6 \mathrm{~nJ} / \mathrm{bit}$, that is 10 times smaller than the equivalent digital implementations.

In order to further improve this efficiency, our target is to reach for the hybrid decoder an energy per decoded bit of $2 \mathrm{~nJ} / \mathrm{bit}$. This leads, according to (1.7), to a maximum power consumption for the overall analog decoder of 60 mW .

As the extrinsic memory and the SISO power consumption has been estimated around 6.5 mW and 13 mW respectively, and 10 mW should be enough for digital control unit, this leaves a power budget of 30 mW for the input memory.

Area occupation Another important element in evaluating the goodness of the hybrid solution we propose, is the area occupation. As already stressed out, one of the main limitations of analog decoder is represented by the fact that their chip size increases linearly with the codeword length. The hybrid decoder has been introduced to overtake this very limitation.

Due to the large number of memory locations required for the analog memory, this block could be the most demanding in terms of area occupation, especially if we decide to use a capacitor as storage element.

As the die area for the analog decoder implementation is fixed at $5 \mathrm{~mm} \times 10 \mathrm{~mm}$ and the room required for I/O pads and ESD protections as also to be taken into account, we assume the area available for the decoder circuitry to be equal to $4 \mathrm{~mm} \times 9 \mathrm{~mm}$. It is reasonable to allocate for the input memory up to a quarter of this area.

### 1.2.2 Voltage-to-Probability Converter

The interface between the memory array and the decoder core has the task to compute the channel conditional probabilities $p_{c h}\left(y_{k}\right)=p\left(y_{k} \mid x_{k}\right)=p\left(y_{k} \mid c_{k}\right)$ required by the SISO. These probabilities are defined as:

$$
\begin{equation*}
p\left(y_{k} \mid x_{k}=x_{j}\right)=\frac{1}{\sqrt{2 \pi} \sigma_{n}} \exp \left(-\frac{\left(y_{k}-x_{j}\right)^{2}}{2 \sigma_{n}^{2}}\right) \tag{1.8}
\end{equation*}
$$



Figure 1.7: Voltage to probability converter for a binary alphabet
where $\sigma_{n}^{2}$ is AWGN channel noise variance, $y_{k}$ is the channel output, $x_{k}$ is the transmitted signal and $\left\{x_{j} \mid 0 \leq j \leq 1\right\}$ denotes the set of all possible signal amplitudes. In our case, as we use a binary alphabet $\{0,1\}$ with a BPSK modulation $\{-a,+a\}$, the (1.8) can be implemented by a simple differential pair, as described in [18]. A simplified schematic of the voltage to probability converter is shown in Fig.1.7.

The differential pair output currents are proportional to the conditional probabilities according to:

$$
\begin{align*}
& I_{D 1}=I_{N} \frac{p\left[y_{k} \mid x_{k}=-a\right]}{p\left[y_{k} \mid x_{k}=-a\right]+p\left[y_{k} \mid x_{k}=+a\right]}  \tag{1.9}\\
& I_{D 2}=I_{N} \frac{p\left[y_{k} \mid x_{k}=+a\right]}{p\left[y_{k} \mid x_{k}=-a\right]+p\left[y_{k} \mid x_{k}=+a\right]}
\end{align*}
$$

This follows from the fact that, if the nMOS transistor are working under weak inversion, their drain currents are equal to:

$$
\begin{align*}
I_{0} & =\frac{I_{N}}{1+\exp \left(\frac{-2 \alpha V_{c h}}{n U_{T}}\right)}  \tag{1.10}\\
I_{1} & =\frac{I_{N}}{1+\exp \left(\frac{2 \alpha V_{c h}}{n U_{T}}\right)}
\end{align*}
$$

where $V_{c h}$ is the voltage received from the channel, $\alpha$ is the amplifier gain and $n U_{T}$ is the


Figure 1.8: Voltage to probability schematic
temperature equivalent voltage. The two conditional probabilities (1.9) can be written as:

$$
\begin{align*}
I_{N} \cdot \frac{p\left[y_{k} \mid x_{k}=-a\right]}{p\left[y_{k} \mid x_{k}=-a\right]+p\left[y_{k} \mid x_{k}=+a\right]} & =I_{N} \cdot \frac{e^{-\frac{\left(V_{c h}+a\right)^{2}}{2 \sigma_{n}^{2}}}}{e^{-\frac{\left(V_{c h}+a\right)^{2}}{2 \sigma_{n}^{2}}}+e^{-\frac{\left(V_{c h}-a\right)^{2}}{2 \sigma_{n}^{2}}}} \\
& =I_{N} \cdot \frac{1}{1+e^{-\frac{\left(V_{c h}-a\right)^{2}-\left(V_{c h}+a\right)^{2}}{2 \sigma_{n}^{2}}}}  \tag{1.11}\\
& =I_{N} \cdot \frac{1}{1+e^{\frac{2 a \cdot V_{c h}}{2 \sigma_{n}^{2}}}} \\
I_{N} \cdot \frac{1}{p\left[y_{k} \mid x_{k}=-a\right]+p\left[y_{k} \mid x_{k}=+a\right]} & =I_{N} \cdot \frac{1}{1+e^{-\frac{2 a \cdot V_{c h}}{2 \sigma_{n}^{2}}}}
\end{align*}
$$

Thus, if the gain $\alpha$ is set equal to:

$$
\begin{equation*}
\alpha=\frac{n U_{T} a}{\sigma_{n}^{2}} \tag{1.12}
\end{equation*}
$$

it is immediate to deduce the (1.9) from (1.11).

Implementation The voltage to probability converter V-to-P has been implemented by means of a pMOS differential pair, as shown in Fig.1.8, where the two nMOS transistors constitute the SISO module input stage.

The two output currents $I_{0}$ and $I_{1}$, proportional to the conditional probabilities according to (1.9), are given by:

$$
\begin{align*}
& I_{0}=\frac{I_{b}}{1+e^{\left(\frac{V_{i n}}{n_{p} U_{T}}\right)}}  \tag{1.13}\\
& I_{1}=\frac{I_{b}}{1+e^{\left(-\frac{V_{i n}}{n_{p} U_{T}}\right)}}
\end{align*}
$$

The module bias current $I_{b}$ has been set equal to the bias current of the sum-product cells of the SISO unit, that is $I_{b}=1 \mu \mathrm{~A}$. In this way, we have no loss of performance due to the fact that the sum of the input currents of the sum-product module is less than its bias current.

As the voltage to probability conversion relies on the exponential voltage to current characteristic of the input pair pMOSs, they have to be sized so as to work under weak inversion for currents up to $1 \mu \mathrm{~A}$, that is the maximum current flowing through one pMOS.

In order to define the transistors dimensions, precision issues must also be carefully considered. The precision requirement for the voltage to probability module can be deduced by considering the equivalence between the log-likelyhood ratio defined by equation (1.6) and the normalized conditional probabilities. For example, let's consider the probability represented by the V-to-P current $I_{0}$. In terms of log-likelyhood ratio, this can be written as:

$$
\begin{align*}
\frac{p\left[y_{k} \mid c_{k}=0\right]}{p\left[y_{k} \mid c_{k}=0\right]+p\left[y_{k} \mid c_{k}=1\right]} & =\frac{e^{-\frac{y_{k}^{2}}{2 \sigma_{n}^{2}}}}{e^{-\frac{y_{k}^{2}}{2 \sigma_{n}^{2}}}+e^{-\frac{\left(y_{k}-1\right)^{2}}{2 \sigma_{n}^{2}}}} \\
& =\frac{1}{1+e^{\left[-\frac{\left(y_{k}-1\right)^{2}}{2 \sigma_{n}^{2}}+\frac{y_{k}^{2}}{2 \sigma_{n}^{2}}\right]}}  \tag{1.14}\\
& =\frac{1}{1+e^{\frac{2 y_{k}-1}{2 \sigma_{n}^{2}}}} \\
& =\frac{1}{1+e^{\lambda_{k}}}
\end{align*}
$$

By comparing this last expression with equation (1.13), we have:

$$
\begin{equation*}
\lambda_{k}=\frac{V_{i n}}{n_{p} U_{T}} \tag{1.15}
\end{equation*}
$$

From the precision requirements for the numeric representation of the log-likelyhood ratio $\lambda_{k}$ in the digital domain reported in Sec.1.2.1, we can derive the accuracy required
for $V_{i n}$. In particular, if we chose to represent the LLR with $n_{b}=4$ bits, the best results are obtained with $(4,1)$. The use of 1 bit for the precision leads to:

$$
\begin{equation*}
\frac{V_{i n}}{n_{p} U_{T}}=1 / 2^{1}=0.5 \tag{1.16}
\end{equation*}
$$

in the analog domain, while the 3 bits for the dynamic become:

$$
\begin{equation*}
\frac{V_{i n}}{n_{p} U_{T}}=2^{3}=8 \tag{1.17}
\end{equation*}
$$

$n_{p}$ is a technology dependent paramiter, which has been estimated for the UMC $0.18-\mu \mathrm{m}$ process through simulations, giving $n_{p}=1.56$ for pMOS transistors and $n_{n}=1.31$ for nMOS. This leads to a precision requirement for $V_{i n}$ of 20 mV with a dynamic range of $\pm 162 m V=324 m V_{p p}$.

As the input interface contains three main blocks, the DAC, the input memory and the voltage to probability converter, the error introduced by each of these blocks can be defined as:

$$
\begin{equation*}
\tilde{g}(x)=g(x) \cdot(1+\varepsilon) \tag{1.18}
\end{equation*}
$$

where $g(x)$ describes the ideal input-output relation for each block and $\varepsilon$ represents the error introduced by the block non-idealities. We assume the error statistic of each block to be a second order Gaussian distribution, that is the error can be fully described by means of its standard deviation $\sigma_{\varepsilon}$ and its mean $m_{\varepsilon}$.

As the input memory contains the memory cells plus several additional buffers, we divided the error introduced by this block between these two components. Thus the overall input chain contains four sources of error $\varepsilon_{D A C}, \varepsilon_{B U F}, \varepsilon_{O T A}, \varepsilon_{V 2 P}$, each described by a Gaussian distribution. Even if it is a simplification, we assume the four error sources to be independent.

The sum of independent Gaussian variables is still a Gaussian variable whose mean and standard deviation are given by:

$$
\begin{align*}
& m_{\varepsilon_{t}}=\sum_{i=1}^{4} m_{\varepsilon_{i}} \\
& \sigma_{\varepsilon_{t}}=\sqrt{\sum_{i=1}^{4} \sigma_{\varepsilon_{i}}} \tag{1.19}
\end{align*}
$$

If we assume the contribution of each block to be equal, equation (1.19) leads to $m_{\varepsilon_{t}}=4 m_{\varepsilon}$ and $\sigma_{\varepsilon_{t}}=2 \sigma_{\varepsilon}$. Thus, in order to fullfill the precision requirement given by

| Parameter | $A_{V t h}$ | $C_{o}$ |
| :--- | :---: | :---: |
| Unit | $[m V \cdot \mu m]$ | $[\mathrm{mV}]$ |
| nMOS | 4.787 | 0.5328 |
| pMOS | 4.6899 | 0.1894 |

Table 1.1: Matching parameters

| $V$-to- $P$ |  |
| :---: | :---: |
| $M_{p}$ | $\frac{20}{0.4}$ |
| $M_{n}$ | $\frac{4}{0.3}$ |
| $M_{b}$ | $\frac{2}{4}$ |

Table 1.2: Voltage to probability transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$
(1.16), we need $m_{\varepsilon_{t}} \pm 3 \sigma_{\varepsilon_{t}} \leq \pm 20 \mathrm{mV}$. This equation can be satisfied if we impose $4 m_{\varepsilon}+$ $3 \cdot \sigma_{\varepsilon} \leq 20 \mathrm{mV}$ which, for each input interface block, becomes:

$$
\begin{equation*}
m_{\varepsilon}+3 \sigma_{\varepsilon} \leq \frac{20}{4} \mathrm{mV}=5 \mathrm{mV} \tag{1.20}
\end{equation*}
$$

For the voltage to probability module, equation (1.20) imposes some constraints on the pMOS transistors dimensions. In particular, as the standard deviation of the threshold voltage error can be modeled by equation:

$$
\begin{equation*}
\sigma\left(\Delta V_{t h}\right)=\frac{A_{V t h}}{\sqrt{W \cdot L}}+C_{o} \tag{1.21}
\end{equation*}
$$

where $A_{V_{T}}$ and $C_{o}$ are technology dependent parameters whose values for the UMC 0.18$\mu \mathrm{m}$ CMOS process are given in Table 1.1, the pMOS width and length have to be chosen so as to keep $\sigma\left(\Delta V_{t h}\right) \leq 1.8 \mathrm{mV}$. At the same time, the transistor dimensions have to ensure they work in their exponential region for currents up to $1 \mu \mathrm{~A}$.

The transistors size for the voltage to probability converter are reported in Table 1.2, where the dimensions of the SISO input stage nMOS are given too.

With this choice, $\sigma\left(\Delta V_{t h}\right) \simeq 1.85 \mathrm{mV}$.

### 1.2.3 Input Memory

The input memory is needed to store the channel output symbols $y_{k}$ during the entire decoding process. In order to maximaze the decoder throughput, the input memory consists
of two identical banks, so that one can be loaded with the channel information while the other is fed to the decoder.

As the maximum codeword length to be handled by the decoder is equal to $L=4832$, each memory bank consists of 4832 pseudo-differential memory cells. Thus, the overall memory counts $4832 \times 2$ memory locations organized in 604 columns and 8 rows. When the minimum codeword is selected, just the first memory row is used, while the whole memory is needed with codewords of maximum length.

In order to meet the need for low power consumption and high density, we chose an architecture based on switched-capacitor circuits to implement the input memory [19], as shown in the simplified memory scheme of Fig.1.9. The main advantage of this switchedcapacitor based solution with respect to conventional buffered sample-and-hold circuit topology is that it requires only one active element per column, instead of one per capacitor, with a significant power saving.

During write operation, the input signal path is fed to 8 buffers, one for each row, in order to cope with the large parasitic capacitance of the long interconnections. The control signals $W_{a}, W_{b}, R_{a}$ and $R_{b}$, with $W_{a}=\bar{W}_{b}$ and $R_{a}=\bar{R}_{b}$, select which bank is being loaded, the other one being fed to the decoder core.

The memory is read one row at a time by the SISO while working in the inner mode by placing the capacitors in feedback configuration across the corresponding OTAs. The writing operation is sequential, so one capacitor couple is accessed per clock cycle.

Memory cell Each memory cell is pseudo-differential and consists of two replicas of the same basic structure formed by five switches, a capacitor and an operational transconductance amplifier, as described in [20]. A single OTA is used for all the cells belonging to the same memory column and is shared between both memory banks A and B. Thus, considering the pseudo-differential nature of the memory, a total amount of $604 \times 2$ OTA are needed. A simplified schematic of the memory cell is shown in Fig.1.10. The top plate of the sampling capacitor can be shorted to the memory input or output by means of the two switches $S_{w}$ and $S_{r}$, while the bottom plate is connected to the reference voltage $V_{r e f}$ or to the OTA input by the two switches $S_{w a}$ and $S_{r a}$ respectively. The additional switch $S_{\text {reset }}$ serves to configure the operational amplifier as a voltage follower in order to force the input and output node to the bias voltage $V_{b} \simeq V_{\text {ref }}$ between two consecutive read phases.

The operation of the circuit can be described by dividing the data acquisition process


Figure 1.9: Input memory block diagram


Figure 1.10: Memory cell structure
into write and read cycles. While a memory bank is being written, the data previously stored in the other one are fed to the decoder core. A simplified timing diagram for the memory read and write phase is shown in Fig.1.11, where T indicates the nominal clock period of 10 ns .

During the write phase, switches $S_{w}$ and $S_{w a}$ are closed, while $S_{r}, S_{r a}$ and $S_{\text {reset }}$ are open, as the OTA is busy reading the data stored in the other memory bank. The input voltage $V_{i n}$ applied to the memory input is sampled and the voltage $V_{i n}-V_{b}$ is stored across the capacitor $C_{S}$. During the read phase, switches $S_{r}, S_{r a}$ are closed while $S_{w}, S_{w a}$ and $S_{\text {reset }}$ are open, placing the OTA in feedback configuration. Due to the amplifier high gain, the voltage at node $X, V_{x}$ is held constant at a value almost equal to $V_{b}$. Thus no charge can be injected into the capacitor $C_{S}$ leading $V_{\text {out }}=V_{\text {in }}$.

Switched-capacitors circuits suffer from charge injection errors. The bottom-plate sampling principle [21], can greatly help to reduce this error source. This means to adopt the switching sequence $S_{w a} \rightarrow S_{w} \rightarrow S_{r a} \rightarrow S_{r}$. However, using this switching sequence, the circuit precision can be further increased by:

- increasing the sampling capacitance $C_{S}$ value. This, however, reduces the sampling bandwidth proportionally;
- making the charge $Q_{w a}$ injected by switch $S_{w a}$ equal to $-Q_{r a}$, where $-Q_{r a}$ is the charge injected by $S_{r a}$. This is only possible if both switches $S_{w a}$ and $S_{r a}$ are of the same type, nMOS or pMOS;
- minimizing both $Q_{w a}$ and $Q_{r a}$. Therefore, the area of both switches $S_{w a}$ and $S_{r a}$


Figure 1.11: Read and write phase timing diagram
should be as small as possible. However, decreasing the width of the switches reduces the sampling frequency bandwidth accordingly.

Another source of error in the proposed circuit is due to the OTA non-idealities. In particular, the OTA finite gain causes the voltage at node $X$ to differ from $V_{b}$ while the finite input capacitance $C_{i n}$ at the same node determines a charge sharing phenomena between $C_{i n}$ and $C_{S}$ during the read phase. The circuit analysis gives:

$$
\begin{equation*}
V_{\text {out }}=\frac{V_{\text {in }}}{1+\frac{1}{A_{v}}\left(1+\frac{C_{\text {in }}}{C_{S}}\right)} \simeq V_{\text {in }}\left[1-\frac{1}{A_{v}}\left(1+\frac{C_{\text {in }}}{C_{S}}\right)\right]=V_{\text {in }}\left[1-\frac{1}{A_{v} \beta}\right] \tag{1.22}
\end{equation*}
$$

where $\beta=C_{S} /\left(C_{i n}+C_{S}\right)$ indicates the feedback factor.
Moreover, as each memory cell has to be read several times, that is once per Turbo iteration, before being overwritten by a new channel information value, each read opera-
tion adds a data-dependent error on the value stored in the memory cell, whose magnitude is given by:

$$
\begin{equation*}
\varepsilon=-\frac{V_{i n}}{A_{v}}\left(1+\frac{C_{i n}}{C_{S}}\right)=-\frac{V_{i n}}{A_{v} \beta} \tag{1.23}
\end{equation*}
$$

As a results, in order to keep the error on the output voltage $V_{\text {out }}$ within acceptable bounds, we need to carefully design the OTA so as to ensure an high low frequency gain $A_{v}$. The input capacitance $C_{\text {in }}$ should also be minimized, even if the speed is not critical, so as to ensure a high feedback factor $\beta$.

The circuit speed during the write phase depends on the time constant:

$$
\begin{equation*}
\tau_{\text {write }}=\left(R_{o n, w}+R_{o n, w a}\right) C_{S} \tag{1.24}
\end{equation*}
$$

where $R_{o n, w}$ and $R_{o n, w a}$ are the on-resistance of switches $S_{w}$ and $S_{w a}$ respectively. Thus, the write time increases linearly with the sampling capacitance $C_{S}$ and the switches onresistance.

During the read phase, the circuit analysis leads to:

$$
\begin{equation*}
\tau_{\text {read }}=\frac{C_{L} C_{i n}+C_{i n} C_{S}+C_{S} C_{L}}{G_{m} C_{S}}=\frac{1}{\beta} \frac{C_{L}+(1-\beta) C_{S}}{G_{m}} \tag{1.25}
\end{equation*}
$$

where $G_{m}$ is the amplifier transconductance. The read time decreases with the amplifier transconductance while its relation with the feedback factor depends on the capacitance values. If $C_{L}$ is dominant with respect to $C_{S}$, an increase of the feedback factor $\beta$ leads to a reduction of the read time $\tau_{\text {read }}$.

The slewing behavior of the circuit has also to be taken into account when designing the OTA. Upon entering the amplification mode, the circuit may experience a large step at the inverting input. As the input capacitance $C_{i n}$ is usually small, the voltage at the output node and at node $X$ does not change immediately, but at the beginning of the read phase $V_{x}=-V_{i n}$. This can force the amplifier into a slewing condition. The slewing factor depends on the chosen amplifier topology. However, for a common source amplifier it is approximately equal to $I_{B} / C_{L}$, where $I_{B}$ is the common source bias current.

OTA design As already pointed out, the OTA has to be carefully designed so as to satisfy the input memory precision and speed requirements. First of all, as we need a high low frequency gain to keep the error on the read voltage within acceptable bounds, we decided to use a regulated cascode topology. The schematic of the OTA is depicted in Fig.1.12. The amplifier voltage gain $A_{v}$ is given by:


Figure 1.12: OTA schematic

$$
\begin{equation*}
A_{v}=g_{m 1} R_{\text {out }} \tag{1.26}
\end{equation*}
$$

where $g_{m 1}$ is transistor $M_{1}$ transconductance and the output resistance $R_{\text {out }}$ is given by:

$$
\begin{equation*}
R_{\text {out }}=\left[\left(g_{m 2} r_{d s 2} r_{d s 1}\right) g_{m 3} r_{d s 3}\right] / /\left(g_{m 6} r_{d s 6} r_{d s 4}\right) \tag{1.27}
\end{equation*}
$$

while for the unit gain frequency we have:

$$
\begin{equation*}
\omega_{u}=g_{m 1} C_{\text {out }} \tag{1.28}
\end{equation*}
$$

as we assume the dominant pole to be at the output node. The relationship between the closed loop bandwidth $\omega_{C L}=\frac{g_{m} \beta}{C_{L}+(1-\beta) C_{S}}$ and the feedback factor $\beta=C_{S} /\left(C_{i n}+C_{S}\right)$ depends on the values of the load capacitance $C_{L}$ and the input capacitance $C_{i n}$. This latter is given by the sum of the long interconnection parasitic capacitance, which has been estimated around 30fF, and the gate capacitance $C_{g 1}$ of $M_{1}$. Due to the tight precision requirements, $M_{1}$ will not be vary small, thus the contribute given by $C_{g 1}$ can not be neglected. We assume $C_{i n}=C_{w}+C_{g 1}=60 \mathrm{fF}$. The contribution of the wire parasitic capacitance can be neglect in the evaluation of $C_{L}$, which is given by the input capacitance of the voltage to probability module, leading to $C_{L}=50 \mathrm{fF}$.


Figure 1.13: Countour lines for $\omega_{C L}\left(\beta, g_{m}\right)$

The contour lines for $\omega_{C L}$ as a function of $\beta$ and $g_{m}$ are shown in Fig.1.13. The specification for the read time of $\tau_{\text {read }}=10 \mathrm{~ns}$ given in Sec.1.2.1 leads to $\omega_{C L}=600 \mathrm{Mrad} / \mathrm{s}$. As the contour lines do not vary linearly with $\beta$ and $g_{m}$, the optimal region for each curve is the central one. Moving towards the sides, a small save in terms of area requires a large increase in power consumption and vice versa.
$\beta$ as a function of $C_{S}$ is plotted in Fig.1.14 while the contour lines for $\omega_{C L}$ as a function of $C_{S}$ and $g_{m}$ are shown in Fig.1.15.

Thus, in order to optimize our design, the feedback factor $\beta$ has to vary between 0.3 and 0.7 , which leads to a sampling capacitance $C_{S}$ between 40 fF and 150 fF .

The choice of the sampling capacitance $C_{S}$ depends also on other factors, such as:

- the write time specification;
- the OTA precision requirements;
- the charge injection phenomena;


Figure 1.14: $\beta$ as a function of $C_{S}$

- the leakage currents;
- the area occupation, as the memory will count up to $4832 \times 2$ capacitors.

In particular, for what it concerns the accuracy requirements, from (1.20) and (1.23) and recalling that the maximum input voltage is equal to 162 mV and that each memory location is read up to 10 times, we can derive the open loop gain specification for the OTA as:

$$
\begin{equation*}
A_{\nu} \beta \geq \frac{10 \cdot 162 \cdot 10^{-3}}{1.8 \cdot 10^{-3}} \tag{1.29}
\end{equation*}
$$

which leads to an open loop gain of at least 60 dB .
As we used the bottom-sampling switching sequence, the error induced by charge injection is minimized and thus does not influence the choice of the sampling capacitance value.

The leakage current may be a non neglectable source of error, due to the long hold time, especially for the first data written in the memory.


Figure 1.15: Countour lines for $\omega_{C L}\left(C_{S}, g_{m}\right)$

In order to estimate this error, we considered the worst case hold time of $48 \mu \mathrm{~s}+48 \mu \mathrm{~s}$, where the first $48 \mu$ s are the time it takes to store a maximum length codeword into the input memory, while the additional $48 \mu$ s are the time required to decode it.

Worst case simulations at a temperature of $80^{\circ} \mathrm{C}$ with a minimum size switch give a leakage current of $\simeq 0.34 \mathrm{pA}$, which leads to a voltage loss of less than 1 mV for a sampling capacitance of 40fF. Moreover, as the data are stored in differential form and the leakage current is almost independent from the input voltage, the error induced by the leakage is mainly a commom mode error, as will be proved by overall memory simulations. As the memory has to allocate $4832 \times 2$ capacitors, the area occupation is the most critical factor in the choice of the sampling capacitance value $C_{S}$. The memory is organized in 8 rows, each containing 1208 capacitors, and it has to be fitted into a maximum area of $9 \mathrm{~mm} \times$ $200 \mu \mathrm{~m}$. Thus, the area available for each memory cell, which includes the sampling capacitor, the five switches and the additional room for the wiring, can be estimated in $\sim$ $7.5 \mu \mathrm{~m} \times 25 \mu \mathrm{~m}$. As the capacitance per unit area for the UMC $0.18-\mu \mathrm{m}$ CMOS technology is $1 \mathrm{fF} / \mu \mathrm{m}^{2}$, we can consider a maximum capacitance value around 75 fF .

| Switches |  |
| :---: | :---: |
| $S_{w}$ | $\frac{0.86}{0.18}$ |
| $S_{w a}$ | $\frac{0.28}{0.18}$ |
| $S_{r}$ | $\frac{0.24}{0.18}$ |
| $S_{r a}$ | $\frac{0.24}{0.18}$ |
| $S_{\text {reset }}$ | $\frac{0.24}{0.18}$ |

Table 1.3: Switches size in $\mu \mathrm{m} / \mu \mathrm{m}$

| OTA |  |
| :---: | :---: |
| $M_{1}$ | $\frac{10}{1}$ |
| $M_{2}$ | $\frac{5}{0.25}$ |
| $M_{3}$ | $\frac{2}{0.18}$ |
| $M_{4}$ | $\frac{10}{2}$ |
| $M_{5}$ | $\frac{0.5}{2}$ |
| $M_{6}$ | $\frac{6}{0.18}$ |

Table 1.4: OTA transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$

Following these considerations, the final value for the sampling capacitance $C_{S}$ has been set to 37.8 fF , which corresponds to an area of $\sim 6 \mu \mathrm{~m} \times 6 \mu \mathrm{~m}$.

The switches $S_{w}$ and $S_{w a}$ have thus been sized so as to satisfy the write time specifications. Recalling that the on resistance of a MOS transistor is given by:

$$
\begin{equation*}
R_{o n}=\frac{1}{\mu C_{o x} \frac{W}{L}\left(V_{D D}-V_{t h}-V_{i n}\right)} \tag{1.30}
\end{equation*}
$$

and that the write specification requires a $\tau_{\text {write }} \leq 1.5 \mathrm{~ns}$, minimum size transistors can not be used. The memory cell switches dimensions are reported in Table 1.3. All switches are implemented with nMOS transistors.

Once the capacitor sampling value has been set, the OTA design follows straightforward. Its transistors size are reported in Table 1.4, while in Fig.1.16 the frequency response of the OTA in typical conditions is shown. We can see how the OTA exhibits an open loop frequency gain $A_{v} \beta \simeq 76 \mathrm{~dB}$ with a unit gain frequency $f_{u} \simeq 100 \mathrm{MHz}$, which corresponds to $\omega_{u} \simeq 630 \mathrm{Mrad} / \mathrm{s}$. The OTA bias current $I_{B}$ has been set to $10 \mu \mathrm{~A}$ while the cascode bias voltage $V_{c p}=600 \mathrm{mV}$.

The corner analysis gives as worst result an open loop frequency gain $A_{\nu} \beta \simeq 64.5 \mathrm{~dB}$, an unit gain frequency $f_{u} \simeq 89.8 \mathrm{MHz}$ with a phase margin of 75.5 deg , which is still well within specifications.

Input buffer To drive the long interconnection parasitic capacitance, a buffer employing a cascode amplifier in unit feedback configuration has been inserted for each memory row at the DAC output, as depicted in Fig.1.9. The wiring capacitance has been estimated by means of the well-known formula:

$$
\begin{equation*}
C_{w}=\left(C_{a} W+2 C_{f}+2 C_{c}\right) L \tag{1.31}
\end{equation*}
$$



Figure 1.16: OTA frequency responce
where W and L are the interconnection wire width and length respectively, $C_{a}$ is the area capacitance between the metal wire and the substrate, $C_{f}$ the fringing capacitance and $C_{c}$ the coupling capacitance with adjacent metal lines. Substituting the UMC technology paramiter into (1.31), we obtain $C_{w}=850 \mathrm{fF}$. The capacitance due to the memory cell switches, $C_{s w}=1.5 \mathrm{pF}$, has also to be taken into account.

The wire resistance is equal to $R_{w}=1 \mathrm{k} \Omega$.
Due to the large parasitic capacitance and resistance, we decide to split the memory into two symmetrical parts and to use a pair of buffers to drive the cells of each memory row. Thus, each buffer has to be sized so as to drive a capacitance of $\left(C_{w}+\right.$ $\left.C_{S w}\right) / 2 \simeq 1.2 \mathrm{pF}$ with a time constant given by $\tau=R_{w}\left(C_{w}+C_{s w}\right) / 8=0.3 \mathrm{~ns}$ instead of $\tau=R_{w}\left(C_{w}+C_{s w}\right) / 2=1.175 \mathrm{~ns}$, where a distributed RC network has been used to model the interconnection wire [22].

The buffer specifications can be easily deduced from the input memory ones given in Sec.1.2.1. In particular, a settling time $t_{s} \leq 9 \mathrm{~ns}$ is required in order to meet the write time specification while a low frequency gain $A_{v} \geq 50 \mathrm{~dB}$ meets the precision requirements.

The schematic of the cascode amplifier is shown in Fig.1.17, while its transistors size are reported in Table 1.5.


Figure 1.17: Input buffer schematic

The low frequency gain is given by:

$$
\begin{equation*}
A_{v}=g_{m 1} R_{\text {out }} \tag{1.32}
\end{equation*}
$$

where

$$
\begin{equation*}
R_{\text {out }}=\left(g_{m 4} r_{d s 4} r_{d s 2}\right) / /\left(g_{m 7} r_{d s 7} r_{d s 6}\right) \tag{1.33}
\end{equation*}
$$

Due to the large load capacitance, we expect the dominant pole to be at the output node. Thus the unit gain frequency can be written as:

$$
\begin{equation*}
\omega_{u}=2 g_{m 1} /\left(C_{w}+C_{s w}\right) \tag{1.34}
\end{equation*}
$$

The buffer bias current $I_{B}$ has been set to $I_{B}=260 \mu \mathrm{~A}$ while $V_{c p}=600 \mathrm{mV}$ and $V_{c n}=1.2 \mathrm{~V}$ are the pMOS and nMOS cascode voltage references respectively. As the pMOS of the input differential pair are biased so as to work between strong and moderate inversion in

| Buffer |  |
| :---: | :---: |
| $M_{1}$ | $\frac{272}{0.25}$ |
| $M_{3}$ | $\frac{40}{0.18}$ |
| $M_{5}$ | $\frac{20}{0.25}$ |
| $M_{7}$ | $\frac{4}{1}$ |
| $M_{b}$ | $\frac{260}{2}$ |

Table 1.5: Buffer transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$


Figure 1.18: Input memory buffer frequency responce
order to pump up the amplifier gain, the input offset voltage can be calculated by:

$$
\begin{equation*}
V_{O S}=\sqrt{n_{p}^{2} U_{T}^{2}\left[\left(\frac{\Delta \frac{W}{L}}{\frac{W}{L}}\right)^{2}+\left(\frac{\Delta I_{B}}{I_{B}}\right)^{2}\right]+\Delta V_{t h}^{2}} \tag{1.35}
\end{equation*}
$$

where $\frac{\Delta I_{B}}{I_{B}}$ can be deduced from the empiric formula:

$$
\begin{equation*}
\sigma_{\Delta I_{d s}}(\%)=C_{n t}(W L)^{-0.5 x} \tag{1.36}
\end{equation*}
$$

where $C_{n t}$ and $x$ are constant deduced by devices measurements. In our case, $C_{n t} \simeq 1.6503$ and $x \simeq 0.8592$, which gives $\sigma_{\Delta I_{B}}=0.9 \%$. Recalling the UMC technology matching paramiters reported in Table 1.1, this leads to $V_{O S} \simeq 0.85 \mathrm{mV}$, well within the required value.


Figure 1.19: Simplified DAC schematic
The buffer frequency responce in typical conditions is shown in Fig.1.18. The buffer exhibits a low frequency gain $A_{v}=58 \mathrm{~dB}$ with a unit gain frequency of $f_{u}=200 \mathrm{MHz}$.

Corner analysis gives a worst case gain $A_{v}=56.8 \mathrm{~dB}$ with a unit gain frequency of $f_{u}=199 \mathrm{MHz}$ and a phase margin of 78.3deg.

### 1.2.4 DAC

The DAC has to convert the digital representation of the channel output into the differential voltage that is fed to the input analog memory. We estimated a 7-bit resolution, corresponding to a $V_{L S B}=2.5 \mathrm{mV}$, to be high enough to meet the precision requirements of (1.20). To ensure the DAC accuracy, we used for the 3 most significant bits a thermometercode representation. The DAC is implemented using the differential switched-current architecture reported in Fig.1.19. The two resistors $R_{D N}$ and $R_{D P}$ convert the output current to a differential voltage according to:

$$
\begin{align*}
V_{\text {OUTP }}-V_{\text {OUTN }} & =\left(R_{D P}+R_{D N}\right) \cdot\left(\frac{N}{64}-\frac{127}{128}\right) \cdot I_{B D A C} \\
& =2 \cdot I_{B D A C} \cdot R_{D} \cdot\left(\frac{N}{64}-\frac{127}{128}\right) \tag{1.37}
\end{align*}
$$



Figure 1.20: DAC transient output waveforms
where $I_{B D A C}$ indicates the DAC reference current and we assume the two resistors to be equal, that is $R_{D N}=R_{D P}=R_{D}$.

The DAC reference current has been chosen equal to $I_{B D A C}=102.25 \mu \mathrm{~A}$ so as to satisfy the write time specification, considering a load capacitance $C_{L D A C} \simeq 7 \mathrm{pF}$. According to (1.37), this leads to $R_{D}=50 \Omega$ and to an overall DAC current consumption of:

$$
\begin{equation*}
I_{T O T}=7 \cdot 4 \cdot I_{B D A C}+2 \cdot I_{B D A C}+I_{B D A C}+\frac{I_{B D A C}}{2}+\frac{I_{B D A C}}{4}=3.246 \mathrm{~mA} \tag{1.38}
\end{equation*}
$$

where the first term is due to the thermometer-code most significant bits while the others are due to the last four significant bits.

The DAC performance have been estimated by means of MonteCarlo simulations considering both process and mismatch variations. The DAC has been fed with a sequence of input words switching between the maximum and minimum channel signal amplitude, which corresponds to $V_{\text {in }}=+162 \mathrm{mV}$ and $V_{\text {in }}=-162 \mathrm{mV}$ respectively. An example of the corresponding transient DAC output waveform is shown in Fig.1.20.

The results of 500 MonteCarlo iterations are shown in Fig.1.21, Fig.1.22 and Fig.1.23, where the differential voltage has been sampled at the farest memory cell input after a settling time of 9 ns , thus considering also the error due to the input buffers. The maximum error mean is equal to $m_{\varepsilon_{D A C}}+m_{\varepsilon_{B U F}} \simeq 0.3 \mathrm{mV}$ with a standard deviation of $\sqrt{\sigma_{\varepsilon_{D A C}}^{2}+\sigma_{\varepsilon_{B U F}}^{2}} \simeq 2.2 \mathrm{mV}$, thus well within the required precision.

The DAC layout has been drawn following the scheme proposed in [23], so as to


Figure 1.21: Pre-layout simulations error on 600 random codewords


Figure 1.22: Post-layout simulations error on 600 random codewords


Figure 1.23: Post-layout simulations error on 600 random codewords
ensure an high accuracy.

### 1.3 Simulations Results

The performace of the decoder interface have been evaluated by simulating the memory read and write phase separately, the overall precision being calculated as:

$$
\begin{align*}
m_{\varepsilon_{t}} & =m_{\varepsilon_{r}}+m_{\varepsilon_{w}} \\
\sigma_{\varepsilon_{t}} & =\sqrt{\sigma_{\varepsilon_{r}}^{2}+\sigma_{\varepsilon_{w}}^{2}} \tag{1.39}
\end{align*}
$$

where $\varepsilon_{r}$ and $\varepsilon_{w}$ are the errors introduced by the read and write process respectively.

### 1.3.1 Read Phase Simulations

In order to evaluate the input memory performance during the read phase, extensive MonteCarlo simulations with 500 iterations and considering both process and mismatch variations have been performed. As the error depends on the input voltage $V_{i n}$ being read (1.23) but also on the data read by means of the OTA during the previous read cycle $V_{i n_{p r e}}$, we have considered two different cases:
(a) $V_{\text {in }}=162 \mathrm{mV}$ and $V_{\text {in }}$ pre $=-162 \mathrm{mV}$;
(b) $V_{\text {in }}=-162 \mathrm{mV}$ and $V_{\text {in }_{\text {pre }}}=162 \mathrm{mV}$.

In both cases, the memory cell has been read up to 10 times and the corresponding output voltage $V_{\text {out }}$, together with the two voltage to probability converter drain currents $I_{0}$ and $I_{1}$, has been sampled after 10 ns and 30 ns from the beginning of each read phase.

As 3 bits of dynamic are required for the log-likelyhood ratio representation, the range of all possible values for $I_{0} / I_{1}$ can be deduced rewriting (1.6) as:

$$
\begin{equation*}
\lambda_{k}=\log \frac{I_{0}}{I_{1}}=\left[-2^{(3-1)}, 2^{(3-1)}-1\right] \simeq[-4,4] \tag{1.40}
\end{equation*}
$$

from which we obtain:

$$
\begin{equation*}
\frac{I_{0}}{I_{1}}=\left[e^{-4}, e^{4}\right]=\left[\frac{1}{54}, 54\right] \tag{1.41}
\end{equation*}
$$

Thus, when the differential input voltage $V_{\text {in }}=162 \mathrm{mV}, I_{0} / I_{1} \geq 54$ while when $V_{\text {in }}=$ $-162 \mathrm{mV}, I_{1} / I_{0} \geq 54$. The 1 bit of precision requires:

$$
\begin{equation*}
\sigma_{\frac{I_{0}}{I_{1}}} \leq e^{0.5}=1.65 \tag{1.42}
\end{equation*}
$$

The MonteCarlo simulations results are summarized in Table 1.6 for case (a) and in Table 1.7 for case (b). In both cases, two voltage references $V_{b x}=300 \mathrm{mV}$ and $V_{b y}=$ 700 mV are considered for the nMOS of the SISO input stage. The common mode output voltage error is also reported, where the nominal common mode value is set to 900 mV . As an example, the transient output waveform corresponding to case (a) are shown in Fig.1.24, while the MonteCarlo results for the tenth reading with $V_{i n}=-162 \mathrm{mV}, V_{i n_{p r e}}=$ -162 mV and $V_{b x}=300 \mathrm{mV}$ are shown in Fig.1.25.

As expected, the error on the differential output voltage worsen with the number of reading to the point of not meeting the precision requirements. However, even if the simulated conditions represent the worst case for the error, when the input voltage $V_{\text {in }}=$ $\pm 162 \mathrm{mV}$ we are far from the most critical situation for the SISO decoding process, that is represented by the equalprobability condition of two symbols.

|  | $R_{1,10 n s} V_{b x}$ | $R_{1,10 n s} V_{b y}$ | $R_{1,30 n s} V_{b x}$ | $R_{1,30 n s} V_{b y}$ | $R_{10,10 n s} V_{b x}$ | $R_{10,10 n s} V_{b y}$ | $R_{10,30 n s} V_{b x}$ | $R_{10,30 n s} V_{b y}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $m_{\varepsilon_{r}}[\mathrm{mV}]$ | -0.202 | -2.066 | 0.209 | -0.709 | 5.709 | -5.582 | 6.321 | -4.292 |
| $\sigma_{\varepsilon_{r}}[\mathrm{mV}]$ | 3.266 | 3.406 | 3.269 | 3.393 | 4.588 | 5.092 | 4.621 | 5.083 |
| $m_{I_{1}}$ | 2.107 | 17.417 | 63.394 | 60 | 2.616 | 22.278 | 71.667 | 55.098 |
| $\sigma_{I_{1}}$ | 0.329 | 3.569 | 7.568 | 7.357 | 0.480 | 4.469 | 9.849 | 8.95 |
| $I_{I_{0}}$ |  |  | 45.92 | 44.206 |  |  | 25.686 | 5.927 |
| $m_{\varepsilon_{r c m}}[\mathrm{mV}]$ |  |  | 6.229 | 6.216 |  |  | 12.43 | 12.213 |
| $\sigma_{\varepsilon_{r c m}[\mathrm{mV}]}$ |  |  |  |  |  |  |  |  |

Table 1.6: $V_{i n}=162 \mathrm{mV}$ and $V_{\text {in }_{\text {pre }}}=-162 \mathrm{mV}$

|  | $R_{1,10 n s} V_{b x}$ | $R_{1,10 n s} V_{b y}$ | $R_{1,30 n s} V_{b x}$ | $R_{1,30 n s} V_{b y}$ | $R_{10,10 n s} V_{b x}$ | $R_{10,10 n s} V_{b y}$ | $R_{10,30 n s} V_{b x}$ | $R_{10,30 n s} V_{b y}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $m_{\mathcal{E}_{r}}[\mathrm{mV}]$ | 0.169 | -1.71 | 0.565 | -0.353 | 4.22 | -5.049 | 4.828 | -3.755 |
| $\sigma_{\varepsilon_{r}}[\mathrm{mV}]$ | 3.273 | 3.437 | 3.275 | 3.4133 | 4.477 | 5.159 | 4.516 | 5.132 |
| $m_{I_{1}}$ | 2.085 | 17.05 | 64.07 | 60.627 | 2.615 | 22.285 | 72.716 | 55.981 |
| $T_{0}$ |  |  |  |  |  |  |  |  |
| $\sigma_{I_{1}}$ | 0.331 | 3.571 | 7.579 | 7.408 | 0.482 | 4.679 | 9.729 | 9.204 |
| $m_{\varepsilon_{r c m}}[\mathrm{mV}]$ |  |  | 26.52 | 24.187 |  |  | 11.424 | 6.661 |
| $\sigma_{\varepsilon_{r c m}}[\mathrm{mV}]$ |  |  | 6.229 | 6.216 |  |  | 16.251 | 9.010 |

Table 1.7: $V_{\text {in }}=-162 \mathrm{mV}$ and $V_{\text {in }}$ pre $=162 \mathrm{mV}$

|  | $R_{1,10 n s} V_{b x}$ | $R_{1,10 n s} V_{b y}$ | $R_{1,30 n s} V_{b x}$ | $R_{1,30 n s} V_{b y}$ | $R_{10,10 n s} V_{b x}$ | $R_{10,10 n s} V_{b y}$ | $R_{10,30 n s} V_{b x}$ | $R_{10,30 n s} V_{b y}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $m_{\varepsilon_{r}}[\mathrm{mV}]$ | 0.959 | 0.063 | 0.973 | 0.064 | 1.833 | 0.192 | 1.904 | 0.195 |
| $\sigma_{\mathcal{E}_{r}}[\mathrm{mV}]$ | 3.257 | 3.340 | 3.267 | 3.393 | 4.261 | 4.854 | 4.285 | 4.948 |
| $m_{\frac{I_{1}}{}}$ | 0.997 | 1.002 | 0.981 | 1.003 | 0.962 | 1.002 | 0.812 | 1.004 |
| $\sigma_{I_{1}}$ | 0.014 | 0.085 | 0.997 | 0.105 | 0.022 | 0.120 | 0.089 | 0.138 |
| $m_{\mathcal{I}_{r c c}}[\mathrm{mV}]$ |  |  | 26.52 | 24.187 |  |  | 25.599 | 5.865 |
| $\sigma_{\mathcal{E}_{\text {rcm }}}[\mathrm{mV}]$ |  |  | 6.229 | 6.216 |  |  | 12.238 | 12.081 |

Table 1.8: $V_{i n}=0 \mathrm{mV}$ and $V_{i n_{p r e}}=162 \mathrm{mV}$


Figure 1.24: Memory transient output waveforms

It is thus useful to analyze the input memory error when $V_{i n}=0 \mathrm{~V}$. From the results reported in Table 1.8 with $V_{\text {in }}$ pre $=162 \mathrm{mV}$, we see how in this case the error mean is reduced with respect to both case (a) and (c) as expected from (1.23), while its standard deviation is comparable with the one obtained with $V_{i n}= \pm 162 \mathrm{mV}$.

However, we will see how the error introduced by the memory cell during the read phase is dominant with respect to the one introduced by the memory write process, leading to the fact that the obtained resultsare good enough to meet the overall input interface specifications, as we will see in Sec.1.4.

It is important to notice how the voltage to probability output currents $I_{0}$ and $I_{1}$ require a longer time to settle with respect to the memory output voltage $V_{\text {out }}$ when the full dynamic is required. In fact, after a read time of 30 ns , they exhibit the desired dynamic but with a large distribution, due to process variations. However, this has a negligible impact on the overall decoder performance as a read time of 40 ns brings the standard deviation value back within the required limits given by (1.42).

The error on the common mode voltage is well within the required bounds, as a maximum error of $\pm 50 \mathrm{mV}$ can be tolerated [24].


Figure 1.25: MonteCarlo simulation results with $V_{i n}=-162 \mathrm{mV}$ and $V_{i n_{p r e}}=-162 \mathrm{mV}$

### 1.3.2 Write Phase Simulations

The performance of the input interface during the write phase have been evaluated by means of MonteCarlo simulations, considering both mismatch and process variations. The DAC has been fed with 6007 -bit random codewords, which has been given a time of 9 ns to be stored into the farest memory cells from the DAC output, that are the first and last cell of the eighth memory row. For each codeword, 100 instances have been simulated.

The MonteCarlo simulations results reported in Fig.1.26 show a maximum error mean $m_{\varepsilon_{w}}=523 \mu \mathrm{~V}$ with a standard deviation $\sigma_{\varepsilon_{w}}=2.221 \mathrm{mV}$. Thus, considering the worst case error during read phase and the one introduced by the voltage to probability converter, from equation (1.39) we obtain:

$$
\begin{align*}
m_{\varepsilon_{t}} & =6.321+0.523=6.884 \mathrm{mV} \\
\sigma_{\varepsilon_{t}} & =\sqrt{5^{2}+2^{2}+1.85^{2}} \simeq 5.7 \mathrm{mV} \tag{1.43}
\end{align*}
$$

where no systematic offset has been considered for the voltage to probability module. The results obtained are in line with the precision specifications.

The corresponding transient waveforms are shown in Fig.1.27, together with the DAC input value.


Figure 1.26: MonteCarlo write phase transient simulation results


Figure 1.27: MonteCarlo DAC output waveforms


Figure 1.28: Pre-layout simulations error on 600 random codewords



Figure 1.29: Post-layout simulations error on 600 random codewords

Post-layout simulations show how the influence of parasitic on the precision performance is negligible. In Fig.1.28 and Fig.1.29 the error at the DAC and buffer output on 600 random codewords is reported for pre- and post-layout simulations, respectively.

### 1.3.3 Power Consumption

As the input memory consists of $604 \cdot 2$ OTAs, two for each memory column, and 8.4 buffers, four for each memory row, the total power consumption can be easily computed as:

$$
\begin{aligned}
P_{\text {memory }} & =604 \cdot 2 \cdot P_{\text {OTA }}+8 \cdot 4 \cdot P_{\text {buffer }} \\
& =\left(604 \cdot 2 \cdot I_{B_{\_} \text {OTA }}+8 \cdot 4 \cdot I_{\text {B_BUFFER }}\right) V_{D D} \\
& =(604 \cdot 2 \cdot 10.5 \mu \mathrm{~A}+8 \cdot 4 \cdot 260 \mu \mathrm{~A}) 1.8 \mathrm{~V} \\
& \simeq 38 \mathrm{~mW}
\end{aligned}
$$

This result slightly exceeds the power consumption specification given in Sec.1.2.1, leading to an overall decoder power consumption of $\simeq 70 \mathrm{~mW}$. According to (1.7), this translates into an energy per decoded bit of 2.1 nJ , which is very close to the original target of 2nJ/bit.

### 1.4 Conclusions

The input interface for an hybrid analog decoder has been designed in the UMC 0.18$\mu \mathrm{m}$ CMOS process. Simulations results show how the interface performance meet the required specifications, both in terms of precision and speed. The overall circuitry area occupation is $\simeq 36 \mathrm{~mm}^{2}$ with a power consumption of $\simeq 40 \mathrm{~mW}$.

The interface is part of a fully analog iterative decoder for a serially concatenated convolutional code, reconfigurable in both block length and code rate. The decoder exhibits an efficiency of $2.1 \mathrm{~nJ} /$ bit which outperforms digital decoders with the same block length, that is around 5000, of a factor up to 50 [25]. The chip, whose layout is shown in Fig.1.30, has been fabricated in UMC $0.18-\mu \mathrm{m}$ CMOS process and is now under test.

The solution of an hybrid decoder implemented with an analog core and memory seems a promising strategy to tackle the limits of traditional analog implementations. In fact, this structure shows the advantages of the analog approach without its drawbacks, which are mainly due to the linear dependence between code block length and decoder dimensions.


Figure 1.30: Hybrid analog decoder layout

## $\sim$

## Analog Decoding for Data Storage Applications

The demand of high density, high throughput solid-state nonvolatile memories has been constantly increasing in the past decade, due to tendency to convert and store images, videos and music in digital format. The core of all nonvolatile semiconductors memory devices is a matrix of single memory cells, which maintain their state even without supply voltage. In order to reach higher memory density, continual efforts to reduce the single cell area have been made, scaling the process lithography. In addition to shrinkage of the feature size, the memory density can be increased by storing more information bits within a single cell. However this rises new reliability issues, to couple with multilevel cells memories resort to use on-chip error correction code (ECC).

In this chapter, after a brief overview of flash memories technology, the advantages and drawbacks of commonly used ECC, such as linear block codes, are analyzed. Thus a new ECC scheme for multilevel flash memories, based on trellis coded modulation strategy, is proposed.

### 2.1 Flash Memories

The core of a flash memory consists of an array of memory cells placed on a word-line/bitline grid. Although in the past different types of flash architectures have been proposed, today two of them can be considered as a standard: the common ground NOR flash that, due to its fast random read access time, is attractive for applications such as program-code storage and the NAND flash, optimized for high density data storage.


Figure 2.1: Flash cell cross section

### 2.1.1 Floating Gate Transistor

The flash cell of both memory types is basically the floating-gate MOS transistor shown in Fig.2.1, that is a transistor with a gate completely surrounded by dielectrics, the floating gate, and electrically governed by a capacitive coupled control gate [26]. Applying a high voltage between the source and the gate-drain terminals of the floating-gate MOS, causes electrons to be injected in the floating gate which, being electrically isolated, acts as a storing electrode for the device.

The charge injected onto the floating gate effectively shifts the I-V curves of the transistor, as shown in Fig.2.2, thus allowing modulation of the apparent threshold voltage $V_{T}$ seen from the control gate. Usually the neutral (or positively charged) state is associated with the logic state " 1 " while the negatively charged state, corresponding to electrons stored in the floating gate, is associated with the logical " 0 ".

### 2.1.2 NOR and NAND Flash

NOR and NAND flash memories [27] use the same basic cell memory described in Sec.2.1.1 but differ in the way the cells are arranged in an array, leading to different characteristics both in terms of memory density and flexibility.

In the NOR architecture, the cells are arranged in a matrix through rows and columns in a NOR-like structure, as shown in Fig.2.3. Flash cells sharing the same gate constitute a wordline, while those sharing the same drain contact constitute the bitline. In this array organization, every cell contains also a source contact. All the cells sources are connected to a common source electrode, which is usually connected to the ground.

The data stored in a NOR cell can be determined by measuring the threshold voltage of the floating gate MOS transistor. The best and fastest way to do it is by reading the


Figure 2.2: I-V curves of a floating-gate MOS without (curve A) and with (curve B) electrons stored in the floating gate


Figure 2.3: NOR flash array


Figure 2.4: NAND flash array
current driven by the selected cell at a fixed gate voltage and then comparing it with the current of a reference cell.

In order to write or to erase a flash cell, that is to force electrons above or across the dielectrics energy barrier so as to inject them onto the floating gate or to remove them from it, two physical mechanisms are used, which exploit two different physical effects:

- the channel hot electron mechanism, where electrons gain enough energy to pass the oxide-silicon barrier thanks to the electric field in the transistor channel between source and drain;
- the Fowler-Nordhein electron tunneling mechanism, where a quantum-mechanical tunnel is induced across a thin tunneling oxide between the silicon surface and the floating gate by applying a strong electric field.

A NOR flash memory cell is programmed by channel hot electron injection in the floating gate at the drain side and it is erased by means of the Fowler-Nordhein electron tunneling through the tunnel oxide from the floating gate to the silicon surface.

Using the NOR architecture, cells can be accessed directly, thus leading to fast random read access time. At the same time, the programming times are slow due to the need for precise control of the thresholds.

These properties make this style of flash memory attractive for applications such as program-code storage. Other applications, such as video or audio file storage, do not need fast random access, but are better served by large storage density, fast erasure and programming, and fast serial access.

These requirements are more readily provided by the NAND architecture, where the basic module consists of 16 or 32 floating-gate transistors connected in series, as shown in Fig.2.4. This chain is connected to the bit line and to the source line by means of two select transistors. By eliminating all contacts between word lines, the resulting cell size is approximately $40 \%$ smaller than the NOR cell.

To read a NAND cell, all the other memory cells connected in series with the selected one have to be activated by applying a gate voltage higher than the maximum programmable threshold voltage. The word line of the selected cell is biased at a fixed voltage, so as to conduct only if in the neutral or logic " 1 " equivalent state.

The programming and erasing of NAND flash are both performed using the FowlerNordhein electron tunneling mechanism, which reduces the current requirements compared to the channel hot electron one, thus allowing for the programming of many modules in parallel while keeping power consumption under control.

### 2.1.3 Multilevel Flash Cell

The most efficient way to scale the actual cell size for any given technology is offered by the multilevel concept [28-32].

The multilevel idea is based on the ability to precisely control the amount of charge injected in the floating gate, so that the threshold voltage of each cell can be programmed to any of $q=2^{m}$ levels, with $m>1$, each corresponding to a different logical state.

A single cell programmed using the multilevel approach is thus capable of storing $m$ bits, reducing considerably the cost-per-bit.

The threshold distributions of a 1-bit and a 2-bit memory cell are shown in Fig.2.5 [33]. With the multilevel approach, all the programmed threshold voltage levels must be


Figure 2.5: Conceptual representation of bilevel and multilevel threshold voltage distributions
allocated within a predetermined voltage window. This window is enlarged with respect to the case of conventional bilevel memories to provide more room for the stored levels. However, in practice, this increase can not be very high, so as to limit charge transfer through the gate oxide during program/erase operation and prevent excessive voltages across the oxide during storage time.

This leads to a reduced spacing between adjacent programmed voltage levels, which makes the reliability of multilevel memories more and more critical as the number of bits/cell increases.

### 2.1.4 Reliability Issues

Reliability issues are particularly critical for flash memories as data retention must be guaranteed after ten years of storage at room temperature and at least $10^{5}$ read, program
and erase cycles.
Several failure mechanisms can affect flash memory reliability, even if they can be traced back to three main effects: read disturb, program distrub and data retention degradation [34].

Read Disturb Read disturb affects the cell under reading and a number of unselected cells, that is the cells placed in the same wordline of the cell being read in a NOR flash memory and the cells belonging to the same wordline and the ones connected in series with the selected one in a NAND-type memory.

Due to the high voltages applied to the gate of these unselected cells, a so called softprogramming mechanism can take place, giving rise to an undesired threshold voltage shift, especially for cells programmed to low level.

Read disturb becomes more critical after program/erase cycling, due to the oxide degradation caused by the high electric fields applied during write and erase operations.

A secondary failure mechanism that takes place in NAND-type flash during read operation is related to the dependency of the reading current on the so called Background Pattern. The current driven by the cell being read can vary considerably with the threshold voltage levels programmed in the cells connected in series with the selected one.

Program Disturb Program disturb leads to an undesired threshold voltage shift due to soft programming in the unselected cells that experience the high voltages applied to write the memory cells. In order not to seriously affect the memory reliability, the programming voltages must be carefully chosen.

Program disturb in NAND-type arrays are also due to the capacitive coupling of adjacent cells floating gates. Several different programming sequences and algorithms have been studied to limit this effect, which results particularly detrimental in multilevel memories.

Data Retention Data retention degradation, due to leakage of electrons from the floating gate through the surrounding oxide, depends on the amount of charge stored and hence on the threshold voltage shift of programmed levels. It worsens in the presence of oxides degraded due to program/erase cycling.

All the reliability issues described above become more critical in the case of multilevel flash memories as compared to the bilevel ones, due to the reduced spacing between
adjacent programmed levels and to the higher threshold shift of the highest programmed states.

In particular, the enhanced cell threshold voltage range worsens the extrinsic charge loss, because this phenomena strongly depends on tunnel oxide retention electric field. Moreover, program disturbs are made more severe by the longer programming time necessary for multilevel programming.

### 2.2 Error Correcting Codes for Multilevel Flash Memories

As already pointed out in Sec.2.1.4, in multilevel flash memories issues such as disturbs and data retention become more and more critical due to the reduced space between adjacent programmed threshold voltage levels. As a consequence, multilevel memories are increasing relaying on error correction code techniques to ensure adequate reliability, in particular in all those applications where a large number of program/erase cycles are required. This is especially true for memories capable of storing more than two bits per cells, such as the 16-level NAND flash presented in [35].

As far as the error correction capability requirement has been moderate, that is for 1 or 2 bit/cell memory, linear block codes, such as Hamming or BCH codes, have been the ideal choice as they combine good performance and relative design simplicity [36]. Nonetheless, their complexity is deemed to increase significantly with the number of bits/cell, since a larger correction capability will be required in order to keep memory reliability the same.

The BCH decoder for 2 bit/cell NAND Flash presented in [36] exhibits a correction capability up to 5 errors with a letency time increasing from the $60 \mu$ s required to detect a single error to the $250 \mu$ s needed to correct 5 errors. Once the error condition has been detected by syndromes calculation, the Berlekamp-Massey algorithm [37,38] is used to compute the errors locator polynomial. Then, the error positions are found by a Chien machine, exploiting polynomial roots search in $G F\left(2^{15}\right)$. Since this operation is the most time-consuming, parallelism should be exploited. However, the complexity and area overhead of Chien finite-state machine grow dramatically as the parallelism increases. Since single error is more likely to happen than multiple errors in 2 bit/cells, two different Chien cores are implemented: a simplified Chien core finds single-error position while another one manages 2 to 5 errors cases.

However, it is worth to notice that the correction of a single-bit error, which is generally sufficient for bilevel flash, is not satisfactory for multilevel memories, where an error can in principle affect all the bits stored in a faulty cell. Codes that process all the bits belonging to the same cell as a symbol and are therefore capable of detecting and correcting symbol errors rather than bit errors could be more appropriate, even though at the cost of larger area of the decoding circuitry and increased access time penalty.

The key requirements of an ECC for a flash memory are a reduced area overhead, including that coming from the storage of parity information, minimum impact on access time and data transfer rate and a limited power consumption increase due to the coding and decoding circuitry. Moreover, any error corresponding to the failure of a single cell, which involves up to $m$ bits in multilevel memories, must be corrected.

The last requirement can be fulfilled in a multilevel flash by using nonbinary codes, which are based on arbitrary finite alphabets with more than two symbols [39]. The same way as the content of a bilevel cell is associated to a binary digit, the content of a $q$-level cell, with $q>2$, can be associated to a $q$-ary symbol. In such a way, a single-cell error corresponds to a single-symbol error that can be handled easily using a $q$-ary code. Many of the error correcting schemes used for bilevel memories can be fitted to multilevel ones by replacing binary codes with nonbinary codes.

## 2.3 q-ary Hamming Codes

As a simple case study, we consider a $q$-ary Hamming Code, with $q>2$. The concepts described in Appendix Sec.A. 2.2 for binary Hamming code, that is with $q=2$, can be easily extended to the case of $q$-ary Hamming codes. In particular, for any integer $r \geq$ 2, $q$-ary Hamming codes have block length $n=\left(q^{r}-1\right) /(q-1)$ and data length $k=$ $\left(q^{r}-1\right) /(q-1)-r$. The Hamming bound is given by (A.5), where $t$ indicates the error correction capability in terms of symbols.

The parity check matrix $\mathbf{H}$ over the Galois field $G F\left(2^{m}\right)$ can be constructed by choosing, as columns, all the nonzero $r$-uples of elements from $G F\left(2^{m}\right)$ in such a way that all the columns of $\mathbf{H}$ are linearly independent from one another. In this way we can construct a single-symbol ECC.

To implement this kind of code for nonbinary symbols there are two possible approaches. In the first approach, sum and multiplication operations over $G F\left(2^{m}\right)$ are implemented, so that encoder e decoder circuits are directly obtained from the nonbinary
parity check matrix $\mathbf{H}$. In the second, the nonbinary parity check matrix $\mathbf{H}$ is transformed into a binary form and standard binary operations are then implemented.

### 2.3.1 Analog Decoding

Linear block codes can be efficiently decoded in the analog domain by means of the Gallager algorithm [40], proposed by Galleger in 1962 to decode binary Low Density Parity Check (LDPC) codes.

The implementation of a CMOS analog decoder for binary Hamming code has already been demonstrate in [41], with encouraging results with respect to the digital counterpart, both in terms of area occupation and power consumption.

The basic building blocks of such a decoder are the soft-gates described in Appendix Sec.A.4.2. By choosing a different alphabet rather than the binary one $\{0,1\}$ for $X, Y$ and $Z$, we can easily realize in the analog domain sum and multiplication operations over $G F\left(2^{m}\right)$, from which the implementation of analog decoders for nonbinary codes is straightforward.

As an example, let's consider the shortened Hamming code $(36,32)$ over $G F(4)$ presented in [42], whose parity-check matrix $\mathbf{H}$ is given by:

$$
\begin{equation*}
\mathbf{H}=\left[\mathbf{P}^{T} \mid \mathbf{I}\right] \tag{2.1}
\end{equation*}
$$

where the nonsystematic part $\mathbf{P}^{T}$ is the 4 by 32 matrix:

$$
\mathbf{P}_{(36,32)}^{T}=\left(\begin{array}{l}
00011100000001111111110000111111  \tag{2.2}\\
01100100111110000011231111000011 \\
10101011001230012301001123112300 \\
11010023231002310010002311231123
\end{array}\right)
$$

The correspondence between binary and $G F(4)$ notation is given in Table.2.1.
The analog decoder complexity for such a code can be easily estimated from matrix $\mathbf{P}^{T}$ as each row of $\mathbf{P}^{T}$ corresponds to a check node, implemented in the analog domain by a soft-XOR, while the number of the decoder equal-gates is given by the information data symbols, each one represented by a matrix $\mathbf{P}^{T}$ column. The soft-XOR and equal-gate inputs are given by all non zeros elements of the corresponding matrix row or column. As the Hamming code $(36,32)$ is defined over $G F(4)$, the soft-XOR and equal-gate operations, which implement the sum and multiplication operations module 4 described in Table 2.2, have also to be carried out over $G F$ (4).

| GF $(4) \leftrightarrow$ binary notation |
| :---: |
| $0 \leftrightarrow 00$ |
| $1 \leftrightarrow 01$ |
| $2 \leftrightarrow 10$ |
| $3 \leftrightarrow 11$ |

Table 2.1: $G F(4) \leftrightarrow$ binary notation

| + | 0 | 1 | 2 | 3 |
| :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 1 | 2 | 3 |
| 1 | 1 | 0 | 3 | 2 |
| 2 | 2 | 3 | 0 | 1 |
| 3 | 3 | 2 | 1 | 0 |


| $\times$ | 0 | 1 | 2 | 3 |
| :---: | :--- | :--- | :--- | :--- |
| 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 2 | 3 |
| 2 | 0 | 2 | 3 | 1 |
| 3 | 0 | 3 | 1 | 2 |

Table 2.2: Sum and multiplication over $G F(4)$

The equation (A.56) for soft-XOR gates over $G F(4)$ can be written as:

$$
\left[\begin{array}{c}
p_{z}(0)  \tag{2.3}\\
p_{z}(1) \\
p_{z}(2) \\
p_{z}(3)
\end{array}\right]=\left[\begin{array}{l}
p_{x}(0) p_{y}(0)+p_{x}(1) p_{y}(1)+p_{x}(2) p_{y}(2)+p_{x}(3) p_{y}(3) \\
p_{x}(0) p_{y}(1)+p_{x}(1) p_{y}(0)+p_{x}(2) p_{y}(3)+p_{x}(3) p_{y}(2) \\
p_{x}(0) p_{y}(2)+p_{x}(1) p_{y}(3)+p_{x}(2) p_{y}(0)+p_{x}(3) p_{y}(1) \\
p_{x}(0) p_{y}(3)+p_{x}(1) p_{y}(2)+p_{x}(2) p_{y}(1)+p_{x}(3) p_{y}(0)
\end{array}\right]
$$

while for $G F(4)$ equal-gates equations (A.57) becomes:

$$
\left[\begin{array}{c}
p_{z}(0)  \tag{2.4}\\
p_{z}(1) \\
p_{z}(2) \\
p_{z}(3)
\end{array}\right]=\left[\begin{array}{c}
p_{x}(0) p_{y}(0) \\
p_{x}(1) p_{y}(1) \\
p_{x}(2) p_{y}(2) \\
p_{x}(3) p_{y}(3)
\end{array}\right]
$$

The relative circuit implementations are shown in Fig.2.6 and Fig.2.7 respectively.
The analog decoder core consists of two 18-inputs soft-XOR for the first two rows of the matrix $\mathbf{P}^{T}$, one 20-inputs soft-XOR for the third row and one 22-inputs soft-XOR for the fourth row. As an $N$-inputs soft-gate can be translated into $N-1$ 2-inputs soft-gates and each check node must be replied as many times as its input variables number, the decoder will consist of

$$
2 \cdot 18 \cdot(18-1)+20 \cdot(20-1)+22 \cdot(22-1)=1454
$$



Figure 2.6: 4-ary soft-XOR circuit implementation

2-inputs soft-XOR.
In the same way, the equal-gates number can be estimated by considering that each column of matrix $\mathbf{P}^{T}$ corresponds to an equal-gate whose inputs number is given by the column non zero elements. This leads to 183 -inputs and 144 -inputs equal-gates, thus giving a total amount of

$$
18 \cdot(3-1)+14 \cdot(4-1)=78
$$

2-inputs equal-gates. The decoding circuitry requires a total count of 1532 2-inputs softgates, which is slighter greater than the equivalent gates count given for a digital decoder in [42].

The soft decoder performance have been estimated by means of a C++ behavioral model of the decoder. In the model, the sum-product operations of the MAP decoder are ideal, with probabilities represented by double precision numbers and no source of distortion, offset or noise is taken into account. In Fig.2.8 the analog decoder performance after a finite number of iterations $N_{t}=6$, that is the diamond curve, are compared with


Figure 2.7: 4-ary equal gate circuit implementation
those of a hard decoding algorithm, represented by the square curve. The soft decoding algorithm shows a code gain of 4 dB with respect to the uncoded channel (circle curve) at $\mathrm{BER}=10^{-3}$, which reduces to 2 dB for the hard one. Thus the soft decoding algorithm, which is suitable to be implemented in the analog domain, offers a good advantage in terms of performance with respect to the digital or hard decision one, which can justify the slighter gates count increase required for the analog decoding circuitry implementation.

### 2.4 Convolutional Punctured Codes

As all the reliability issues described in Sec.2.1.4 become more serious for multilevel memories due to the reduced spacing between adjacent threshold voltage levels, codes with a higher correcting capability as the $q$-ary Hamming presented in Sec. 2.3 may be required to ensure memory reliability.

Convolutional codes described in Appendix Sec.A.2.3 show large free distance, which translates into good error correcting capability, and, due to their trellis structure, can be


Figure 2.8: BER curves for $\operatorname{Hamming}(36,32)$ code over $G F(4)$
naturally decoded in the analog domain, with advantages with respect to digital decoder implementation both in terms of area occupation and power consumption [8, 9, 43].

As one key requirement for multilevel memories ECC is a limited area overhead due both to parity bits and encoder/decoder circuitry, high-rate codes have in any case to be preferred. A way to obtain a high-rate convolutional code starting from a low-rate one is by "puncturing", as already described in Sec.1.1.1.

In general, a punctured rate $b / n$ code can be constructed from a low-rate $1 / n$ code, which is described by $n$ generator polynomials $G_{i}, i+1,2, \cdots, n$. The complexity of decoding such a code is reduced to that of decoding the $1 / n$ code. If we further impose a condition on the $n$ generator polynomials so that only two of them at most differ, then the punctured code may also be regarded as having been generated from a rate $1 / 2$ code and this can further reduce the complexity of decoding to that of decoding the corresponding rate $1 / 2$ code [44-46].

### 2.4.1 Analog Decoder Implementation

The performance of high-rate punctured convolutional codes depend on the original lowrate code and on the perforation patter. In order to evaluate if they are suitable to be used


Figure 2.9: Convolutional code trellis section
as ECC for multilevel memories, we consider the rate 15/16 8-state tail-baiting trellis code whose generator polynomials are given by:

$$
\begin{equation*}
G(D)=\left[D^{3}+D+1, D^{3}+D^{2}+D+1\right] \tag{2.5}
\end{equation*}
$$

with punturation matrix [47]:

$$
P=\left(\begin{array}{lllllllllllllll}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0  \tag{2.6}\\
1 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 1
\end{array}\right)
$$

The corresponding trellis without puncturing is shown in Fig.2.9.
Thus $k$ user bits generate $2 k$ coded bits that are then punctured so as to obtain a $n$-bit codeword, where $n=16 / 15 k$. If we use the convolutional code as ECC for multilevel flash memory cells with $q=4$ levels, that is with 2-bit/cell, the $n$ coded bits are grouped 2 by 2 and saved into $n / 2$ memory cells.

If we analyze the structure of the perforation matrix $P$, we can see as the bits corresponding to the two first columns are written in the same memory cell. This is equivalent to say that the threshold voltage level programmed in the cell corresponds to the branch metric of the trellis section obtained by merging two consecutive trellis sections, as shown in Fig.2.10. Otherwise, if a column contains no zero, the information saved in the corresponding memory cell is relative to the branch metric of a single trellis section.


Figure 2.10: Merge of two consecutive trellis sections according to the $\mathrm{B}_{\mathrm{a}}$ scheme

The perforation matrix can thus be written as:

$$
P^{\prime}=\left(\begin{array}{ll|ll|ll|ll|l|ll|ll|ll}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0  \tag{2.7}\\
1 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 1
\end{array}\right)
$$

where column 9 and 10 have been swept. The branch metric of each section is given by the threshold voltage level of the corresponding memory cell.

Convolutional codes can be efficiently decoded by means of the BCJR algorithm described in Sec.A.3.5. The analog implementation of this decoding algorithm is based on Sum-Product modules, which implement the (A.55) equation. A block diagram of the analog decoder implementation is shown in Fig.2.11. In particular, the B and C chains that implement the forward and backward recursion of the BCJR algorithm will consist of 5 different B and C cells types, corresponding to the 4 possible merging schemes of two consecutive trellis sections due to puncturization plus the case without puncturing. The same applies to blocks D , which compute the probability of each user alphabet symbols. In particular, cell $\mathrm{Ba}_{\mathrm{a}}$ corresponds to the perforation matrix section

$$
\left|\begin{array}{ll}
0 & 0 \\
1 & 1
\end{array}\right|
$$

which will be indicated as $|\times 1 \times 1|$. In the same way, cell $\mathrm{B}_{\mathrm{b}}$ relates to section $|\times 11 \times|$, $\mathrm{B}_{\mathrm{c}}$ to $|1 \times 1 \times|, \mathrm{B}_{\mathrm{d}}$ to $|1 \times \times 1|$ and B to $|11|$. The structure of C cells are specular to those of the corresponding B cells while D cells calculate the user bits probabilities of each


Figure 2.11: Convolutional code SISO block diagram
trellis section combining the information given by the forward and backward recursions. As an example, the trellis of cell $\mathrm{B}_{\mathrm{a}}$ and $\mathrm{D}_{\mathrm{a}}$ are reported in Fig..2.12.

### 2.4.2 Performance Analysis

The performance of the decoding algorithm have been estimated through a $\mathrm{C}++$ behavioral model of the decoder. To consider a realistic scenario, the analog decoder has been design to work on information data fields $M=256$ bits wide, as in the 2 bit/cell NOR memory described in [48], that uses as ECC a BCH $(274,256,2)$. The code is terminated, that is the 256 user bits plus 3 termination bits are coded by means of the rate $1 / 2$ convolutional code described in Sec.2.4 and then punctured with the pattern given by the perforation matrix

$$
P_{\text {Tot }}=\left(\begin{array}{llllllllllllllllll}
P^{\prime} & 1 & P^{\prime} & P^{\prime} P^{\prime} P^{\prime} P^{\prime} & 1  \tag{2.8}\\
& 1 & & P^{\prime} & P^{\prime} P^{\prime} P^{\prime} P^{\prime} & 1 & P^{\prime} & P^{\prime} P^{\prime} P^{\prime} P^{\prime} & 1 & P^{\prime} \\
& & & 1 & & & & & & &
\end{array}\right)
$$

where $P^{\prime}$ is defined by equation (2.7). Thus the resulting code rate is $R=256 / 280$. The punctured convolutional code performance are reported in Fig.2.13, together with that of the Hamming code $(36,32)$ over $G F(4)$ presented in Sec.2.3. The trellis code performance with and without puncturing are compared in Fig.2.14.

The punctured code shows a loss of 5 dB with respect to the non-punctured version at


Figure 2.12: Trellis sections according to the merging scheme $|\times 1 \times 1|$

BER $=10^{-3}$, which makes high-rate punctured convolutional codes poor ECC for multilevel memories.

### 2.5 Trellis Coded Modulation for Multilevel Flash Memories

Neither linear block codes nor convolutional punctured codes seem the right choice when dialing with memory cells with a storage capability equal or greater than 3 bits/cell. The former, because their complexity is deemed to increase when a higher error correction capability is required, as shown by the state-of-the-art BCH decoder demonstrated in [36], where both the BCH decoder area and latency time increase with the error correction capability; the latter due to their trade off between code performance and rate.


Figure 2.13: BER of rate $15 / 16$ punctured (diamonds) convolutional code, soft-decoded (stars) and hard-decoded (crosses) Hamming $(36,32)$ over $G F(4)$


Figure 2.14: BER of rate $15 / 16$ punctured (diamonds) and non-punctured (circles) convolutional code

(a) One 4-levels memory cell as 4 AM modulation

(b) Two 4-levels memory cells as 16 QAM modulation

Figure 2.15: Analogy between memory cell threshold voltage distributions and signal constellations

### 2.5.1 Multilevel Flash Memories as Signal Constellations

Trellis Coded Modulation (TCM) is a well-known technique for improving band-limited communication systems combining trellis codes (i.e. convolutional codes) and modulation, as described in Appendix Sec.A.2.5.

The concept of signal modulation can be easily extended to a flash memory by considering the analogy between signal constellations and threshold voltage levels (or charge distributions) positions within one or more memory cells as shown in Fig.2.15. In fact, in the case of memory cells, the symbol space is a one- or multi-dimensional discrete grid of (approximately) equally spaced voltage levels.

The noise deteriorating the transmitted signal is due to Gaussian white noise and intersymbol interference in the case of transmission channels, while it can be attributed to process variations and all the disturb effects described in Sec.2.1.4 in the case of flash memory cells. Nevertheless, in both cases the noise effect can be described by means of a triangular or Gaussian distribution.

As a result, a multilevel flash memory cell can be modeled as an Amplitude Shift Keying (ASK) modulation channel plus a white additive Gaussian noise (AWGN), thus allowing the use of TCM to either increase memory reliability or to enable higher effective storage capacity [49].

### 2.5.2 TCM for Multilevel Flash Memories

The effectiveness of TCM-based solutions for multilevel flash memories in terms of error correcting performance, coding redundancy, silicon cost and operation latency, has been successfully demonstrated in [50].

As TCM requires soft decoding algorithms, analog decoders could offer advantages with respect to digital implementations, both in terms of area occupation and power consumption, as already proved for Trellis and Turbo codes analog decoders [8, 9, 13, 14].

In order to study the feasibility and complexity of the analog approach for a TCM decoder, the case where the effective capacity of a given ML flash memory is to be increased from $m$ to $m+1$ information bits/cell is considered. To maintain the same level of reliability despite the decrease of spacing between adjacent voltage levels, the use a relatively low-rate TCM code is here proposed, based on the following strategy: the number of bits stored in a cell is increased from $m$ to $m+2$, with the additional bit (w.r.t. $m+1$ ) used as a parity bit. Thus a rate $(m+1) /(m+2)$ TCM code is implemented as, for realistic values of $m$, is powerful and relatively simple to decode with respect to higher rate block codes.

More specifically, as flash memory cells with up to 16 levels are by now the present technology [35], we consider a case where $m=2$, thus aiming at an effective memory capacity of 3 information bits/cell, achieved through the use of 4-bit, that is 16 -level, multilevel cells protected by a rate $3 / 4 \mathrm{TCM}$ code. To model the behavior of the multilevel memory cell, the following simplifying (but not unrealistic) assumptions are made:
(i) the $q$ threshold voltage values that can be programmed in a memory cell are equally spaced and bounded between fixed minimum and maximum voltage $V_{T M I N}$ and $V_{T M A X}$;


Figure 2.16: TCM encoder block diagram
(ii) the readout threshold voltage has a Gaussian distribution with standard deviation $\sigma_{n}$ centered around the nominal programmed value.

To improve memory reliability by means of a TCM code, there are different possibilities, depending on the constituent trellis code [50,51]. The use of a rate $1 / 2$ convolutional code allows to keep the decoder complexity low. Thus, depending on the modulation scheme chosen, three main different scenarios can be considered:
(a) one-dimensional 16-ASK modulation;
(b) bi-dimensional 12-ASK modulation;
(c) bi-dimensional 16-ASK modulation.

Case (a) and (c) are designed for a multilevel memory cell with $q=16$ levels and distance $\Delta$ between two adjacent threshold voltages, while case (b) refers to a multilevel memory cell with $q=12$ levels and thus with distance between two adjacent threshold voltages $\Delta^{\prime}=4 / 3 \Delta$. It can be proved [51-53] that the minimum distance between two codewords that differ only in the uncoded bits decreases from $4 \cdot \Delta$ in case (a) to $8 / 3 \cdot \sqrt{2} \cdot \Delta$ in case (b) down to $2 \cdot \sqrt{2} \cdot \Delta$ in case (c).


Figure 2.17: Cell mapping

Solution (b) can be discarded since it shows a lower codeword distance with respect to (a) with the same storage density of 3 information bits/cell. As the increased storage capacity of (c) comes at the price of a reduced error correction capability and also of a more complex decoder architecture with respect to (a), the more promising solution seems the latter one, which refers to a rate $3 / 4 \mathrm{TCM}$ with one-dimensional 16 -ASK modulation designed for a 4 bits/cell multilevel memory.

The encoding process can be better understood with the help of Fig.2.16, where a block diagram of the full TCM encoder is shown.

An $M$-bit information data field is divided into $n$-bit wide sub-fields ( $n=3$ in Fig.2.16); $k$ bits of the sub-field ( $k=1$ in Fig.2.16) are fed to a convolutional encoder, that generates a $k+1$-bit codeword. The $n-k$ uncoded bits and the $k+1$-bit codeword are fed to a modulator that maps them on a threshold voltage level that is then programmed in a multilevel cell.

In Fig.2.16, bits $a_{1, k}, a_{2, k}, a_{3, k}$ indicate the $n=3$-bit uncoded information bits, while bits $b_{1, k}, b_{2, k}, b_{3, k}, b_{4, k}$ represent the $(n+1)=4$-bit codeword. The codewords mapping into memory cell threshold voltage levels, together with the multilevel cells threshold voltage distributions, is shown in Fig.2.17.

The 16 threshold voltages are partitioned into 4 subsets $S_{0}, S_{1}, S_{2}, S_{3}$, each consisting


Figure 2.18: 4-state TCM trellis state diagram
of 4 threshold voltages with minimum intra-set distance $\delta=4 \cdot \Delta$. The two convolutional encoders output bits $b_{3, k}$ and $b_{4, k}$ select one subset out of 4 , while the two uncoded bits $b_{1, k}$ and $b_{2, k}$ select one of the 4 voltages within the chosen subset.

In order to analyze in depth the design trade-offs peculiar to the analog approach, the performance and limitations of two different fully analog TCM decoders, a 4-state and an 8 -state one, are compared. In both cases, the constituent convolutional code has been chosen in accordance with [51] so as to maximize the asymptotic coding gain.

The generator polynomials for the 4 -state trellis code are given by:

$$
\begin{equation*}
G(D)=\left[D^{2}+1, D\right] \tag{2.9}
\end{equation*}
$$

Its corresponding trellis state diagram is shown in Fig.2.18. It is worth to notice how each branch of the trellis diagram is actually constituted by 4 parallel branches which differ for the uncoded bits $b_{1, k}$ and $b_{2, k}$.

The generator polynomials for the 8 -state code, whose trellis state diagram is shown in Fig.2.19, are given by:

$$
\begin{equation*}
G(D)=\left[D^{3}+D+1, D^{2}\right] \tag{2.10}
\end{equation*}
$$

This choice leads to an asymptotic coding gain of 3.5 dB for the 4 -state code, which becomes 3.9 dB for the 8 -state one.

### 2.5.3 Analog Decoding Algorithm

If implemented in the digital domain, the core blocks of a TCM decoder are a demodulator, which computes the branch metrics for the Viterbi decoder based on the loglikelyhood of the threshold voltage levels read from the memory array, and a soft-Viterbi decoder matched to the convolutional code.


Figure 2.19: 8 -state TCM trellis state diagram

In the analog approach, the Viterbi decoder is replaced by a soft-input soft-output (SISO) module implementing the BCJR decoding algorithm with a current mode circuit [6]. In addition, also the demodulator is realized by means of a fully analog circuit.

A simplified block diagram of the decoder core is reported in Fig.2.20. The BCJR decoding algorithm implemented by the decoder core computes, for each cell (i.e., for each trellis section), the most probable subset $S_{i, k}$ of threshold voltage levels, while the demodulator finds the most probable level within each subset $S_{i, k}$.

The SISO is designed following the Sum-Product approach described in [6] and already successfully used in several analog decoder prototypes, such as [14]. In particular, $B$ and $C$ chains implement the forward and backward recursion of the BCJR algorithm; blocks D compute the probability of the trellis coded input bit $a_{3, k}$; blocks E compute the probability of each subset $S_{i, k}$; blocks F1 and F2 realize the demodulation computing the probability of bits $a_{1, k}$ and $a_{2, k}$.

The decoder inputs are the conditional probabilities $p\left(V_{T k}^{R} \mid S_{i, k}\right)$, that is the probabilities of the threshold voltage $V_{T k}^{R}$ read from memory cell $k$, given that the programmed level belongs to the subset $S_{i, k}$, with $i=0,1,2,3$, while the demodulator inputs are the conditional probabilities $p\left(a_{1, k} \mid S_{i, k}\right)$ and $p\left(a_{2, k} \mid S_{i, k}\right)$.
$p\left(V_{T k}^{R} \mid S_{i, k}\right)$ correspond to the branch metrics required by the MAP decoder and are


Figure 2.20: Block diagram of the analog decoder core
calculated from the threshold voltage conditional probabilities $p\left(V_{T k}^{R} \mid V_{T k}(j)\right)$. These latter ones are the probabilities of $V_{T k}^{R}$ given the $V_{T k}(j)$ programmed voltage level, with $j=0,1,2, \cdots, 15$, and are computed by the voltage to probability module described in Sec.3.2.

Recalling the mapping scheme already shown in Fig.2.17, the conditional probabilities $p\left(V_{T k}^{R} \mid S_{i, k}\right)$ are given by:

$$
\begin{aligned}
& p\left(V_{T k}^{R} \mid S_{0, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(0)\right)+p\left(V_{T k}^{R} \mid V_{T k}(4)\right)+p\left(V_{T k}^{R} \mid V_{T k}(8)\right)+p\left(V_{T k}^{R} \mid V_{T k}(12)\right) \\
& p\left(V_{T k}^{R} \mid S_{1, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(1)\right)+p\left(V_{T k}^{R} \mid V_{T k}(5)\right)+p\left(V_{T k}^{R} \mid V_{T k}(9)\right)+p\left(V_{T k}^{R} \mid V_{T k}(13)\right) \\
& p\left(V_{T k}^{R} \mid S_{2, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(2)\right)+p\left(V_{T k}^{R} \mid V_{T k}(6)\right)+p\left(V_{T k}^{R} \mid V_{T k}(10)\right)+p\left(V_{T k}^{R} \mid V_{T k}(14)\right) \\
& p\left(V_{T k}^{R} \mid S_{3, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(3)\right)+p\left(V_{T k}^{R} \mid V_{T k}(7)\right)+p\left(V_{T k}^{R} \mid V_{T k}(11)\right)+p\left(V_{T k}^{R} \mid V_{T k}(15)\right)
\end{aligned}
$$

where the programmed voltage level $V_{T k}(j)$ corresponds to the distribution $D j$ of the $k$-th memory cell.

In the same way, conditional probabilities $p\left(a_{1, k} \mid S_{i, k}\right)$ and $p\left(a_{2, k} \mid S_{i, k}\right)$ can be calculated by:

$$
\begin{aligned}
& p\left(a_{1, k}=0 \mid S_{0, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(0)\right)+p\left(V_{T k}^{R} \mid V_{T k}(4)\right) \\
& p\left(a_{1, k}=0 \mid S_{1, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(1)\right)+p\left(V_{T k}^{R} \mid V_{T k}(5)\right) \\
& p\left(a_{1, k}=0 \mid S_{2, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(2)\right)+p\left(V_{T k}^{R} \mid V_{T k}(6)\right) \\
& p\left(a_{1, k}=0 \mid S_{3, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(3)\right)+p\left(V_{T k}^{R} \mid V_{T k}(7)\right) \\
& p\left(a_{1, k}=1 \mid S_{0, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(8)\right)+p\left(V_{T k}^{R} \mid V_{T k}(12)\right) \\
& p\left(a_{1, k}=1 \mid S_{1, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(9)\right)+p\left(V_{T k}^{R} \mid V_{T k}(13)\right) \\
& p\left(a_{1, k}=1 \mid S_{2, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(10)\right)+p\left(V_{T k}^{R} \mid V_{T k}(14)\right) \\
& p\left(a_{1, k}=1 \mid S_{3, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(11)\right)+p\left(V_{T k}^{R} \mid V_{T k}(15)\right)
\end{aligned}
$$

and:

$$
\begin{aligned}
& p\left(a_{2, k}=0 \mid S_{0, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(0)\right)+p\left(V_{T k}^{R} \mid V_{T k}(8)\right) \\
& p\left(a_{2, k}=0 \mid S_{1, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(1)\right)+p\left(V_{T k}^{R} \mid V_{T k}(9)\right) \\
& p\left(a_{2, k}=0 \mid S_{2, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(2)\right)+p\left(V_{T k}^{R} \mid V_{T k}(10)\right) \\
& p\left(a_{2, k}=0 \mid S_{3, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(3)\right)+p\left(V_{T k}^{R} \mid V_{T k}(11)\right) \\
& p\left(a_{2, k}=1 \mid S_{0, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(4)\right)+p\left(V_{T k}^{R} \mid V_{T k}(12)\right) \\
& p\left(a_{2, k}=1 \mid S_{1, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(5)\right)+p\left(V_{T k}^{R} \mid V_{T k}(13)\right) \\
& p\left(a_{2, k}=1 \mid S_{2, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(6)\right)+p\left(V_{T k}^{R} \mid V_{T k}(14)\right) \\
& p\left(a_{2, k}=1 \mid S_{3, k}\right)=p\left(V_{T k}^{R} \mid V_{T k}(7)\right)+p\left(V_{T k}^{R} \mid V_{T k}(15)\right)
\end{aligned}
$$

$B$ and C modules The B and C modules compute the forward and backward recursion of the BCJR algorithm, taking as inputs the trellis branch metrics.

The forward recursion can be described by:

$$
\begin{aligned}
& B_{k}(0)=B_{k-1}(0) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right)+B_{k-1}(1) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right) \\
& B_{k}(1)=B_{k-1}(2) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)+B_{k-1}(3) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right) \\
& B_{k}(2)=B_{k-1}(0) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right)+B_{k-1}(1) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right) \\
& B_{k}(3)=B_{k-1}(2) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)+B_{k-1}(3) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)
\end{aligned}
$$



Figure 2.21: B module
for the 4-state code, and by:

$$
\begin{aligned}
& B_{k}(0)=B_{k-1}(0) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right)+B_{k-1}(1) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right) \\
& B_{k}(1)=B_{k-1}(2) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)+B_{k-1}(3) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right) \\
& B_{k}(2)=B_{k-1}(4) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right)+B_{k-1}(5) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right) \\
& B_{k}(3)=B_{k-1}(6) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)+B_{k-1}(7) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right) \\
& B_{k}(4)=B_{k-1}(0) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right)+B_{k-1}(1) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right) \\
& B_{k}(5)=B_{k-1}(2) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)+B_{k-1}(3) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right) \\
& B_{k}(6)=B_{k-1}(4) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right)+B_{k-1}(5) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right) \\
& B_{k}(7)=B_{k-1}(6) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)+B_{k-1}(7) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)
\end{aligned}
$$

for the 8 -state one. The relative trellises are shown in Fig.2.21
The backward recursion is computed according to the trellises of Fig.2.22, that are specular to the B cells ones. Thus, for the 4 -state code we have:

$$
\begin{aligned}
C_{k-1}(0) & =C_{k}(0) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right)+C_{k}(2) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right) \\
C_{k-1}(1) & =C_{k}(0) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right)+C_{k}(2) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right) \\
C_{k-1}(2) & =C_{k}(1) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)+C_{k}(3) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right) \\
C_{k-1}(3) & =C_{k}(1) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)+C_{k}(3) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)
\end{aligned}
$$



Figure 2.22: C module
which become:

$$
\begin{aligned}
& C_{k-1}(0)=C_{k}(0) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right)+C_{k}(4) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right) \\
& C_{k-1}(1)=C_{k}(0) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right)+C_{k}(4) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right) \\
& C_{k-1}(2)=C_{k}(1) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)+C_{k}(5) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right) \\
& C_{k-1}(3)=C_{k}(1) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)+C_{k}(5) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right) \\
& C_{k-1}(4)=C_{k}(2) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right)+C_{k}(6) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right) \\
& C_{k-1}(5)=C_{k}(2) \cdot p\left(V_{T k}^{R} \mid S_{0, k}\right)+C_{k}(6) \cdot p\left(V_{T k}^{R} \mid S_{2, k}\right) \\
& C_{k-1}(6)=C_{k}(3) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)+C_{k}(7) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right) \\
& C_{k-1}(7)=C_{k}(3) \cdot p\left(V_{T k}^{R} \mid S_{1, k}\right)+C_{k}(7) \cdot p\left(V_{T k}^{R} \mid S_{3, k}\right)
\end{aligned}
$$

for the 8 -state one.

D module The D modules, whose trellises are shown in Fig.2.23, compute the a posteriori probabilities for the convolutional encoder input bit $a_{3, k}$, implementing the equations:

$$
\begin{aligned}
& p\left(a_{3, k}=0\right)=B_{k}(0) \cdot C_{k}(0)+B_{k}(1) \cdot C_{k}(1) \\
& p\left(a_{3, k}=1\right)=B_{k}(2) \cdot C_{k}(2)+B_{k}(3) \cdot C_{k}(3)
\end{aligned}
$$

for the 4-state code, and:

$$
\begin{aligned}
& p\left(a_{3, k}=0\right)=B_{k}(0) \cdot C_{k}(0)+B_{k}(1) \cdot C_{k}(1)+B_{k}(2) \cdot C_{k}(2)+B_{k}(3) \cdot C_{k}(3) \\
& p\left(a_{3, k}=1\right)=B_{k}(4) \cdot C_{k}(4)+B_{k}(5) \cdot C_{k}(5)+B_{k}(6) \cdot C_{k}(6)+B_{k}(7) \cdot C_{k}(7)
\end{aligned}
$$

for the 8 -state one.


Figure 2.23: D module


Figure 2.24: E module

E module The E modules recompute the branch metric after the forward-backward recursions, according to the trellises of Fig.2.24 and the following equations:

$$
\begin{aligned}
& \phi_{k}(0)=B_{k-1}(0) \cdot C_{k}(0)+B_{k-1}(1) \cdot C_{k}(2) \\
& \phi_{k}(1)=B_{k-1}(2) \cdot C_{k}(1)+B_{k-1}(3) \cdot C_{k}(3) \\
& \phi_{k}(2)=B_{k-1}(0) \cdot C_{k}(2)+B_{k-1}(1) \cdot C_{k}(0) \\
& \phi_{k}(3)=B_{k-1}(2) \cdot C_{k}(3)+B_{k-1}(3) \cdot C_{k}(1)
\end{aligned}
$$

for the 4-state case, which become:

$$
\begin{aligned}
& \phi_{k}(0)=B_{k-1}(0) \cdot C_{k}(0)+B_{k-1}(1) \cdot C_{k}(4)+B_{k-1}(4) \cdot C_{k}(6)+B_{k-1}(5) \cdot C_{k}(2) \\
& \phi_{k}(1)=B_{k-1}(2) \cdot C_{k}(1)+B_{k-1}(3) \cdot C_{k}(5)+B_{k-1}(6) \cdot C_{k}(7)+B_{k-1}(7) \cdot C_{k}(3) \\
& \phi_{k}(2)=B_{k-1}(0) \cdot C_{k}(4)+B_{k-1}(1) \cdot C_{k}(0)+B_{k-1}(4) \cdot C_{k}(2)+B_{k-1}(5) \cdot C_{k}(6) \\
& \phi_{k}(3)=B_{k-1}(2) \cdot C_{k}(5)+B_{k-1}(3) \cdot C_{k}(1)+B_{k-1}(6) \cdot C_{k}(3)+B_{k-1}(7) \cdot C_{k}(7)
\end{aligned}
$$

for the 8 -state decoder.


Figure 2.25: F1 and F2 module

F1 and F2 modules The F1 and F2 modules, which exhibit the same trellis depicted in Fig.2.25, compute the a-posteriori probabilities for the uncoded bits $a_{1, k}$ and $a_{2, k}$.

The equations for both modules are the same for both the 4 -state and the 8 -state decoder. In particular, for F1 modules we have:

$$
\begin{aligned}
p\left(a_{1, k}=0\right)= & p\left(a_{1, k}=0 \mid S_{0, k}\right) \cdot \phi_{k}(0)+p\left(a_{1, k}=0 \mid S_{1, k}\right) \cdot \phi_{k}(1)+ \\
& p\left(a_{1, k}=0 \mid S_{2, k}\right) \cdot \phi_{k}(2)+p\left(a_{1, k}=0 \mid S_{3, k}\right) \cdot \phi_{k}(3) \\
p\left(a_{1, k}=1\right)= & p\left(a_{1, k}=1 \mid S_{0, k}\right) \cdot \phi_{k}(0)+p\left(a_{1, k}=1 \mid S_{1, k}\right) \cdot \phi_{k}(1)+ \\
& p\left(a_{1, k}=1 \mid S_{2, k}\right) \cdot \phi_{k}(2)+p\left(a_{1, k}=1 \mid S_{3, k}\right) \cdot \phi_{k}(3)
\end{aligned}
$$

while the equations for F 2 modules are given by:

$$
\begin{aligned}
p\left(a_{2, k}=0\right)= & p\left(a_{2, k}=0 \mid S_{0, k}\right) \cdot \phi_{k}(0)+p\left(a_{2, k}=0 \mid S_{1, k}\right) \cdot \phi_{k}(1)+ \\
& p\left(a_{2, k}=0 \mid S_{2, k}\right) \cdot \phi_{k}(2)+p\left(a_{2, k}=0 \mid S_{3, k}\right) \cdot \phi_{k}(3) \\
p\left(a_{2, k}=1\right)= & p\left(a_{2, k}=1 \mid S_{0, k}\right) \cdot \phi_{k}(0)+p\left(a_{2, k}=1 \mid S_{1, k}\right) \cdot \phi_{k}(1)+ \\
& p\left(a_{2, k}=1 \mid S_{2, k}\right) \cdot \phi_{k}(2)+p\left(a_{2, k}=1 \mid S_{3, k}\right) \cdot \phi_{k}(3)
\end{aligned}
$$

### 2.5.4 Performance Analysis

The error correction capabilities of the analog algorithm described above have been estimated through a C++ behavioral model of the decoder. The analog decoder taken into account is the 8 -state one, design to work on information data fields $M=192$ bits wide.


Figure 2.26: BER vs SNR with analog TCM and soft Viterbi decoder

After encoding, the 256 bit codewords are stored in 64 memory cells with $q=16$ levels. To avoid long connections in the transistor-level implementation, the code is terminated. Thus the 192 bits data field is divided into 189 information bits and 3 termination bits. The C++ analog decoder model assumes the availability at its inputs of the trellis code branch metrics computed by the voltage to probability circuit. The sum-product operations of the MAP decoder are ideal, with probabilities represented by double-precision numbers. No source of distortion, offset or noise due to transistor non-idealities is taken into account.

To compare the analog approach performance with that of a digital implementation based on the Soft-Output Viterbi Algorithm (SOVA) [54], a C++ behavioral model of the digital decoder has also been developed. For a fair comparison with the ideal model of the analog TCM, quantization effects have not been taken into account, so that the input branch metrics and the internal path metrics are represented by double-precision numbers. A survivor depth $D=64$ has been chosen and the code is terminated, as for the analog one. The resulting BER vs SNR performance obtained by simulation is reported in Fig. 2.26 (circles) together with the analog TCM curve (diamonds), where the SNR is calculated as the ratio of the signal power to the threshold voltage distribution variance
$\sigma_{n}:$

$$
\begin{equation*}
S N R=\sum_{j=1}^{q} V_{T}^{2}(j) / \sigma_{n} \tag{2.11}
\end{equation*}
$$

The analog TCM outperforms its SOVA counterpart by roughly 1 dB at $\mathrm{BER}=10^{-5}$. This result is not surprising as the SOVA is a suboptimum decoding algorithm since it minimizes the whole codeword error probability instead of the single symbol error probability, as done by the MAP algorithm [55].

Based on the analog decoder prototypes reported in the literature [14], the effect of transistor non-idealities, such as mismatch, noise and distortion, will be of the order of a few tenths of a dB with respect to the performance predicted by the ideal model. Thus, the analog implementation of TCM decoders is an interesting candidate to tackle the problem of the enhanced ECC requirements of multilevel flash memories.

# CMOS Analog TCM Decoders: Design and Performance Analysis 

In this chapter, the design and optimization of the fully analog TCM decoder, whose structure has been described in the previous chapter, is presented. In particular, the design trade-offs peculiar to the analog approach are analyzed in depth by comparing the performance and limitations of the 4 -state and the 8 -state analog decoder, so as to draw some design guidelines for future works.

Both decoders have been designed in the UMC $0.18-\mu \mathrm{m}$ CMOS process, whose main features have been described in Sec.1.1.5.

### 3.1 Cells Design

### 3.1.1 Preliminary Considerations

As already pointed out in Sec.2.5.4, both TCM codes are designed to work on a data field of $M=192$ bits. After encoding, the 256 bit codewords are stored in 64 memory cells with $q=16$ levels, that is with a storage capability of 4 bits/cell. As both codes are terminated, the 192 bits wide data field is divided into 190 information bits and 2 termination bits for the 4 -state code, which become 189 and 3 respectively for the 8 -state one.

The decoder core shown in Fig. 2.20 counts 64 identical sections, each constituted by all the 6 different cell types already presented in Sec.2.5.3.

At transistor level, all cells B-F are a variation of the well-known current mode Gilbert multiplier [56]. As an example, the schematic diagram of cell B for the 8 -state trellis code is shown in Fig.3.1 with the relative trellis section. In particular, the 8 input currents $I_{y, l m n}$ and 8 output currents $I_{z, r s t}$ represent the trellis state probability distributions $p(y)$ and


Figure 3.1: B cell schematic for the 8 -state analog decoder
$p(z)$, while the 4 input currents $I_{x, i j}$ are the trellis branch metrics probability distributions $p(x)$. The output distribution $p(z)$ is evaluated, in the form of a current vector, according to (A.53). The schematics for cell C, D, E, F1 and F2 can be easily drawn from the corresponding trellis presented in Sec.2.5.3.

The structure common to all the cells consists of a number of diode-connected nMOS transistors operating under weak-inversion, which provide that the input currents $I_{x, i j}$ and $I_{y, l m n}$ are made available to the cell core, an equal number of weak inversion nMOS which compute the products between all possible input values $I_{x, i j}, I_{y, l m n}$ and pMOS current mirrors in strong inversion, used to feed the other cells the output currents $I_{z, r s t}$. Every cell also contains a nMOS current source, whose function is to re-normalize the output currents to the bias current $I_{b}$ according to:

$$
\begin{equation*}
I_{x, i j ; y, l m n}=\frac{I_{x, i j}^{\prime} \cdot I_{y, l m n}}{I_{y}}=\frac{I_{y, l m n}}{I_{y}} \cdot \frac{I_{x, i j}}{I_{x}} \cdot I_{b} \tag{3.1}
\end{equation*}
$$

with $i, j, l, m, n, r, s, t \in\{0,1\} . I_{x}$ and $I_{y}$ indicate the sum of all $I_{x, i j}$ and $I_{y, l m n}$ input currents respectively.

The output currents re-normalization is necessary because of the cell trellis, where not all the possible branches between the input and the output states are present, leading to the fact that some currents products items are discarded. Without re-normalization, when

| Cell | $N_{x}$ | $N_{y}$ | $N_{z}$ | $N_{u}$ | $N_{t}$ | $N_{c}$ | $N_{\text {tot }}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| B | 4 | 4 | 4 | 3 | 46 | 64 | 2944 |
| C | 4 | 4 | 4 | 3 | 46 | 64 | 2944 |
| D | 4 | 4 | 2 | 1 | 34 | 64 | 2176 |
| E | 4 | 4 | 4 | 2 | 42 | 64 | 2688 |
| F 1 | 8 | 4 | 2 | 1 | 58 | 64 | 3712 |
| F 2 | 8 | 4 | 2 | 1 | 58 | 64 | 3712 |

Table 3.1: 4-state decoder transistors count

| Cell | $N_{x}$ | $N_{y}$ | $N_{z}$ | $N_{u}$ | $N_{t}$ | $N_{c}$ | $N_{\text {tot }}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| B | 4 | 8 | 8 | 3 | 82 | 64 | 5248 |
| C | 4 | 8 | 8 | 3 | 82 | 64 | 5248 |
| D | 8 | 8 | 2 | 1 | 94 | 64 | 6016 |
| E | 8 | 8 | 4 | 2 | 102 | 64 | 6528 |
| F 1 | 8 | 4 | 2 | 1 | 58 | 64 | 3712 |
| F 2 | 8 | 4 | 2 | 1 | 58 | 64 | 3712 |

Table 3.2: 8-state decoder transistors count
more cells are connected in sequence as shown in Fig.2.20, the total current reduces along the chain, due to the fact that the output current $I_{z}$ of every cell is less than the total input one $I_{x}$ or $I_{y}$. This would translate into a decoder performance degradation.

As all the cells show the same structure, the transistor count for each one can be calculated by equation:

$$
\begin{equation*}
N_{t}=1+2 N_{x}+N_{y}+N_{x} N_{y}+N_{z}+1+N_{z} N_{u} \tag{3.2}
\end{equation*}
$$

where $N_{x}$ and $N_{y}$ are the length of the two input vectors, $N_{z}$ the length of the output vector, $N_{u}$ indicates the number of blocks which use the cell output currents, and $N_{t}$ and $N_{c}$ are the transistors amount for each cell and the total cells number respectively. The total transistors count for the 4-state decoder core is reported in Table 3.1, while in Table 3.2 the data for 8 -state one are shown.

As it was expected, the total transistors count almost double for the 8-state decoder with respect to the 4 -state one, as it adds up to 30.464 for the former while it is equal to 18.176 for the latter one.

### 3.1.2 Weak Inversion Devices

The sum-product cell performance, given the nMOS transistors channel length $L$, depend mainly on the block bias current density $I_{b} / W$, where $W$ is the weak inversion nMOS channel width, as extensively studied in [43,57]. Increasing $I_{b} / W$ improves transistors speed but moves the devices progressively out of the pure exponential (weak inversion) region, thus negatively affecting the decoder static accuracy; on the other hand, a faster device settling time translates into a faster transient towards the asymptotic BER of the analog decoder. Transistors sizing has also opposite effects on devices matching, that is on the decoder static accuracy, and on parasitic capacitances, that is on the decoder speed.

As a consequence, the minimum device size in the sum-product cells was set based on accuracy considerations and then the cell bias current was chosen in order to satisfy the decoding time specifications.

Even if systems following the bio-inspired design stile [58] seem to be more robust against devices mismatch with respect to the conventional analog ones, nevertheless subthreshold devices show an exponential relation between the threshold voltage error and the drain current [59-61] due to the exponential dependence of the drain current on the voltage overdrive $V_{G S}-V_{T_{h}}$. The standard deviation of the threshold voltage error can be modeled by equation (1.21), with the technology dependent parameters $A_{V_{T}}$ and $C_{o}$ values still given in Table 1.1.

The choice of the transistors length $L$ has to take into account short-channel effects, which in principle do not seem the major source of performance loss in analog decoder [62], but whose impact is deemed to increase for large decoders.

These considerations lead to set $L=0.6 \mu \mathrm{~m}$, that is roughly $3 \cdot L_{\text {min }}$, and $W=0.8 \mu \mathrm{~m}$ in order to keep the sum-product cells nMOS transistors area roughly 10 times larger than the device minimum area (i.e., $0.24 \times 0.18 \mu \mathrm{~m}^{2}$ ). The goodness of this choice will be proved by means of extensive MonteCarlo simulations, which will be discussed in section 3.1.7.

### 3.1.3 Bias Current

Once the transistors dimensions are fixed, we need to chose the bias current so as to fullfill the decoder settling time specifications and to guarantee that the nMOS will work in their exponential region.

To set a reasonable value for the decoding time, we considered the state-of-the-art


Figure 3.2: Current-voltage characteristic for a $L=0.6 \mu \mathrm{~m} W=0.8 \mu \mathrm{~m} \mathrm{nMOS}$ transistor

BCH decoder demonstrated in [36], even if it is designed to work on 2 bit/cell memory. This decoder has a correction capability up to 5 errors with a maximum latency time of $250 \mu$ s for a data field of 2048 16-bit words. As the analog decoders work on a data field of 190 or 189 information bits, a $2048 \times 16$ bit data field is decoded in 173 steps for the 4 -state TCM and in 174 steps for the 8 -state one, thus leaving a maximum decoding time of $1.4 \mu$ s for each step in order to guarantee the same latency time of [36].

The current-voltage characteristic for a nMOS transistor with $L=0.6 \mu \mathrm{~m}$ and $W=$ $0.8 \mu \mathrm{~m}$ is shown in Fig.3.2. In order to keep the devices work under true exponential region, the bias current should be in the range of $10^{-10} \mathrm{~A}$, but settling $I_{b}=10^{-10} \mathrm{~A}$ would result in a decoding latency too long for our specifications.

If the devices are working under strong inversion, this introduces a performance degradation, as described in [62], especially if some products items are discarded and the total input current is less than the normalization current $I_{b}$. It is true that even if the bias current $I_{b}$ is quite large, some transistors work in the moderate or weak inversion region if the corresponding probabilities are small. However, in these cases, the decisions are clear and the largest probability are nearly unchanged so that we can neglect the error introduced by those devices working under strong inversion. The most critical situation occurs when no probability is dominant. In this case all the transistors work in the strong inversion region if $I_{b}$ is large enough. As a result, the value of $I_{b}$ should be chosen so that to avoid all the nMOS to work under moderate or strong inversion when all the inputs currents have nearly the same value.

Following these considerations, a current in the range of $\mu \mathrm{A}$ seems a reasonable choice for the cell biasing.

As already pointed out, the bias current $I_{b}$ also determines the decoder performance in terms of decoding speed. In order to settle the optimum value for $I_{b}$, the decoding time has been evaluated for different bias currents by means of transistor-level simulations.

The inputs of the decoder are generated using a C++ program that implements the following steps:

1. generates random binary 190 - or 189 -bits wide user words;
2. adds the termination bits;
3. encodes the data;
4. maps them on 64 memory cells threshold voltages;
5. simulates a discrete Gaussian channel (AWGN);
6. calculates the conditional probabilities $p\left(V_{T}^{R} \mid S_{i}\right), p\left(a_{1} \mid S_{i}\right)$ and $p\left(a_{2} \mid S_{i}\right)$ from the threshold voltage $V_{T}^{R}$ read from each memory cell;
7. makes demodulation and detection;
8. saves the decoded words.

The BER as a function of the cell bias current $I_{b}$ has been simulated at three different SNR, that is 26,27 and 28 dB . As a point of the BER vs SNR curve is considered reliable when a minimum number of 100 erroneous bit is detected, the number of words to be simulated for each SNR has been calculated as 100/(BER • 192), where BER indicates the expected BER and the words are considered 192-bits wide because they include the termination bits. Thus, in order to obtain a reliable BER value, 100 words have been simulated at $\mathrm{SNR}=26 \mathrm{~dB}$, which become 1.000 at $\mathrm{SNR}=27 \mathrm{~dB}$ and 11.500 at $\mathrm{SNR}=28 \mathrm{~dB}$.

In the BER simulations, the decoder is feed with a current vector that represents the calculated conditional probabilities $p\left(V_{T k}^{R} \mid S_{i, k}\right), p\left(a_{1, k} \mid S_{i, k}\right)$ and $p\left(a_{2, k} \mid S_{i, k}\right)$ values and is given a maximum time of $2 \mu$ s to settle. The soft outputs are sampled at subsequent time instants, so as to evaluate the BER as a function of the decoding time. The resulting BER curves for the 8 -state analog decoder are shown in Fig.3.3 for three different biasing current $I_{b}=2,4$ and $8 \mu \mathrm{~A}$.


Figure 3.3: BER as a function of the decoding time

At low SNR, where most of the cells have inputs with almost equal probability, the decoding speed is not affected by the cell bias current value, while it improves considerably with $I_{b}$ at high SNR, when some of the probabilities can be very small. If an error has to be corrected at high SNR, the corresponding cells require a much longer time to commute, because some of the weak inversion nMOS are almost switched off. Increasing the bias current, the nMOS come out of the off region more quickly, thus speeding up the cells settling time.

With $I_{b}=8 \mu \mathrm{~A}$ we can see that, after a settling time of $1.4 \mu \mathrm{~s}$ at $\mathrm{SNR}=28 \mathrm{~dB}$, the decoder BER is equal to $2.1 \cdot 10^{-4}$, which compared with the the ideal $\mathrm{C}++$ model BER of $4.5 \cdot$ $10^{-5}$, seems a good result.

Following these considerations, we choose $I_{b}=8 \mu \mathrm{~A}$ for the 8 -state decoder.
A similar analysis, performed for the 4 -state decoder, has lead to set a bias current of $I_{b}=2 \mu \mathrm{~A}$ for the 4 -state sum-product cells.

### 3.1.4 Bias Transistor

As the cell input currents are re-normalized at the cell bias current value at each decoding step, the bias current of each cell does not need to be replied with a high precision. As a
consequence, the sizing of the nMOS bias transistor is not particularly critical.
However, it is worth to notice how its drain voltage is subjected to strong variations, as all the $M_{x, i j}$ transistors operate under weak inversion. In fact, it is this very variation of the drain voltage that makes the re-normalization possible. Recalling the low of a MOS working under weak inversion:

$$
\begin{equation*}
I_{D}=I_{D 0} e^{\beta\left(V_{G}-V_{S}-V_{T 0}\right)} \tag{3.3}
\end{equation*}
$$

where $I_{D 0}$ and $\beta$ are technology dependent constant, if we assume that all the $M_{x, i j}$ and $M_{x, i j}^{\prime}$ transistors are working in the exponential region, we can write:

$$
\begin{aligned}
\sum I_{x, i j}^{\prime} & =\sum I_{D 0} \cdot \exp \left(\alpha\left(V_{G, i j}-V_{S}^{\prime}\right)\right) \\
& =\sum I_{D 0} \cdot \exp \left(\alpha\left(V_{G, i j}-V_{S}+V_{S}-V_{S}^{\prime}\right)\right) \\
& =\sum I_{x, i j} \cdot \exp \left(\alpha \cdot \Delta V_{S}\right)
\end{aligned}
$$

with $\Delta V_{S}=V_{S}^{\prime}-V_{S}$. Thus every input current is multiplied by a factor $e^{\beta \Delta V_{S}}$, so as to make $\sum I_{x, i j}^{\prime}=I_{b}$.

As the input current sum decreases with respect to $I_{b}$, the multiplying factor increases, lowering the bias transistor drain voltage. As a consequence, the bias nMOS has to be able to work properly with $V_{D S}$ lower that the bias voltage $V_{b x}$.

### 3.1.5 pMOS Current Mirrors

The precision of the pMOS current mirrors has a direct impact on the output probabilities accuracy. Thus, they have to be sized following all the design considerations that apply to standard current mirrors. In particular, since the maximum current flowing into the mirror is equal to the cell bias current $I_{b}$, the maximum voltage across the current mirror is given by:

$$
\begin{equation*}
\left|V_{G S, \max }\right|=\left|V_{t h, p}\right|+\left|V_{e f f}\left(I_{b}\right)\right|=\left|V_{t h, p}\right|+\sqrt{\frac{2 I_{b}}{k_{p} \frac{W}{L}}} \tag{3.4}
\end{equation*}
$$

### 3.1.6 Bias Voltages

As already pointed out, the bias voltage $V_{b x}$ determines the maximum drain to source voltage $V_{D S}$ for the nMOS current source and also the minimum value given the $M_{x, i j}^{\prime}$ transistors $\Delta V_{S}$. $V_{b x}$ has thus to be chosen so as to keep the nMOS current source working under strong inversion.

| 4 -states |  |
| :---: | :---: |
| $M_{p}$ | $\frac{0.5}{1}$ |
| $M_{n}$ | $\frac{0.8}{0.6}$ |
| $M_{b}$ | $\frac{4}{0.4}$ |


| 8 -states |  |
| :---: | :---: |
| $M_{p}$ | $\frac{2}{1}$ |
| $M_{n}$ | $\frac{0.8}{0.6}$ |
| $M_{b}$ | $\frac{16}{0.4}$ |

Table 3.3: Transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$

| Cell | $m_{\varepsilon}$ | $\sigma_{\varepsilon}$ |
| :---: | :---: | :---: |
| B | $2.259 e^{-3}$ | $3.457 e^{-3}$ |
| D | $1.511 e^{-5}$ | $5.676 e^{-3}$ |
| E | $2.464 e^{-5}$ | $4.832 e^{-3}$ |
| F 1 | $1.018 e^{-4}$ | $6.073 e^{-3}$ |

Table 3.4: Error statistic for the 4-state decoder cells

In the same way, the bias voltage $V_{b y}$ determines the $V_{D S}$ range for the $M_{x, i j}^{\prime}$ transistors. As both the sum currents $I_{x}$ and $I_{y}$ are unknown, we can chose $V_{b y}$ so as to optimize a "medium" case, when the $M_{x, i j}^{\prime}$ transistors drain voltage is equal to $V_{b y}$. This leads to:

$$
\begin{equation*}
V_{b y} \simeq \frac{V_{D D}-\mid V_{G S, m_{M A X}}-V_{b x}}{2} \tag{3.5}
\end{equation*}
$$

### 3.1.7 Single Cell Characterization

Following the previous considerations, we set the bias voltages $V_{b x}=300 \mathrm{mV}$ and $V_{b y}=$ 700 mV . The transistor dimensions for both the 4 -state and 8 -state decoder, are reported in Table 3.3.

The goodness of this choice has been estimated by characterizing the output error $\varepsilon$ of each of the 6 cells types, where $\varepsilon$ is defined by:

$$
\begin{equation*}
\tilde{f}(x, y)=f(x, y) \cdot(1+\varepsilon) \tag{3.6}
\end{equation*}
$$

As the error $\varepsilon$ depends on the cell input values, for each cell 100.000 different input vectors have been simulated. The corresponding error statistic is given in Table 3.4, where $m_{\varepsilon}$ indicates the mean error and $\sigma_{\varepsilon}$ its standard deviation. As cell C exhibits the same structure as cell B, and cell F1 the same as cell F2, their error statistics have not been simulated.

The effect of mismatch and process variations on the cell error has also been estimated

| Cell | $m_{\varepsilon}$ | $\sigma_{\varepsilon}$ |
| :---: | :---: | :---: |
| B | $5.5045 e^{-5}$ | $9.691 e^{-3}$ |
| D | $2.672 e^{-5}$ | $1.021 e^{-2}$ |
| E | $1.062 e^{-5}$ | $8.823 e^{-3}$ |
| F 1 | $9.26 e^{-5}$ | $9.22 e^{-3}$ |

Table 3.5: Error statistic for the 8 -state decoder cells

| Cell | $m_{m_{\varepsilon}}$ | $\sigma_{m_{\varepsilon}}$ | $m_{\sigma_{\varepsilon}}$ | $\sigma_{\sigma_{\varepsilon}}$ | $\max \sigma_{\sigma_{\varepsilon}}$ | $\min \sigma_{\sigma_{\varepsilon}}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| B | $1.715 e^{-3}$ | $5.897 e^{-3}$ | $1.847 e^{-2}$ | $2.825 e^{-2}$ | $1.165 e^{-1}$ | $4.053 e^{-7}$ |
| D | $1.035 e^{-3}$ | $6.612 e^{-3}$ | $6.721 e^{-3}$ | $2.13 e^{-2}$ | $1.654 e^{-1}$ | $5.145 e^{-7}$ |
| E | $1.035 e^{-3}$ | $5.132 e^{-3}$ | $1.774 e^{-2}$ | $3.075 e^{-2}$ | $1.397 e^{-1}$ | $7.361 e^{-7}$ |
| F 1 | $8.18 e^{-3}$ | $5.05 e^{-3}$ | $9.803 e^{-4}$ | $8.022 e^{-3}$ | $1.193 e^{-1}$ | $1.341 e^{-6}$ |

Table 3.6: Error statistic with MonteCarlo simulations for the 4 -state decoder cells
by MonteCarlo simulations: for each cell type 100 instances have been simulated, each one with 1.000 different input vectors. The results are reported in Table 3.6.

As the mean value $m_{\sigma_{\varepsilon}}$ of all the standard deviations $\sigma_{\varepsilon}$ calculated on each circuit instance is comparable or even smaller than the standard deviation $\sigma_{\sigma_{\varepsilon}}$, this indicates that the output error depends on the particular input configurations.

This is due to the fact that some transistors work in the moderate or strong inversion region if the corresponding probabilities are large. However, in these cases, the decisions are clear and the error introduced does not affect the decoder performance. As the output error depends on the input configurations, this indicates that we can not use a second order statistic to describe the random variable $\varepsilon$.

### 3.2 Voltage to Probability Cell

The interface between the memory array and the decoder core has the task to compute the branch metrics required by the MAP decoder $p\left(V_{T k}^{R} \mid S_{i, k}\right)$ and the uncoded bits conditional probabilities $p\left(a_{1, k} \mid S_{i, k}\right)$ and $p\left(a_{2, k} \mid S_{i, k}\right)$, defined in Sec.2.5.3, from the actual threshold voltage $V_{T k}^{R}$ programmed in the addressed cell, where $k$ indicates the cell index. Recalling that the branch metrics can be written as:

$$
\begin{equation*}
p\left(V_{T}^{R} \mid S_{i}\right)=\sum_{j \in S_{i}} p\left(V_{T}^{R} \mid V_{T}(j)\right) \tag{3.7}
\end{equation*}
$$



Figure 3.4: Analog circuit for conditional probabilities generation
the voltage to probability module has to calculate the conditional probabilities $p\left(V_{T}^{R} \mid V_{T}(j)\right)$.
These probabilities are defined as:

$$
\begin{equation*}
p\left(V_{T}^{R} \mid V_{T}=V_{T}(j)\right)=\frac{1}{\sqrt{2 \pi} \sigma_{n}} \exp \left(-\frac{\left(V_{T}^{R}-V_{T}(j)\right)^{2}}{2 \sigma_{n}^{2}}\right) \tag{3.8}
\end{equation*}
$$

where $\sigma_{n}^{2}$ is AWGN channel noise variance, $V_{T}^{R}$ is the threshold voltage read from the memory cell, $V_{T}$ is the threshold voltage programmed in the cell and $V_{T}(j), j=0 \div(q-1)$ are all the possible cell voltage levels, that correspond to the $\mathcal{A}$ alphabet symbols.

As already pointed out in Sec.2.5.3, we considered a reference memory cell with $q=16$ threshold voltage levels equally spaced between fixed minimum and maximum value $V_{T M I N}=0 \mathrm{~V}$ and $V_{T M A X}=5.6 \mathrm{~V}$, and a channel noise standard deviation $\sigma_{n}$ varying between 80 and 300 mV .

In the case of a binary alphabet like $\{-1,+1\}$, the (3.8) can be implemented by a simple differential pair, as described in Sec.1.2.2.

The same concept can be easily extended to the case of non binary alphabet, as proved by Frey in [63], by replacing the differential pair with the circuit shown in Fig.3.4.

In this case the output currents $I_{j}$ are proportional to the conditional probabilities $p\left(V_{T}^{R} \mid V_{T}(j)\right.$ according to:

$$
\begin{equation*}
p\left(V_{T}^{R} \mid V_{T}=V_{T}(j)\right)=I_{N} \frac{p\left(V_{T}^{R} \mid V_{T}=V_{T}(j)\right)}{\sum_{i \in \mathcal{A}} p\left(V_{T}^{R} \mid V_{T}=V_{T}(i)\right)} \tag{3.9}
\end{equation*}
$$

where the alphabet $\mathcal{A}=\{0 \div 15\}$.
If we assume that all the nMOS transistors are working in the exponential region, we can derive the expression of each output current $I_{j}$ as a function of the gate voltages $V_{j}$, $j \in \mathcal{A}$, as:

$$
\begin{equation*}
I_{j}=\frac{I_{N}}{\sum_{i \in \mathcal{A}} \exp \left(\frac{V_{i}-V_{j}}{n U_{T}}\right)}, j \in \mathcal{A} \tag{3.10}
\end{equation*}
$$



Figure 3.5: Conceptual schematic for approximated conditional probabilities generation
where $n$ is a technology dependent parameter. Its value for the UMC process has been estimated for nMOS and pMOS by means of transistor simulations, which gave $n_{n}=1.31$ and $n_{p}=1.57$.

It can be proved [63] that, in order to obtain the output currents $I_{j}$, the programmed threshold voltage level read from the memory cell has to be amplified by a different factor for each of the alphabet symbols so as to obtain the gate voltages:

$$
\begin{equation*}
V_{j}=V_{T}^{R} \cdot \frac{n U_{T} V_{T}(j)}{\sigma^{2}}-\frac{n U_{T}\left(V_{T}(j)\right)^{2}}{2 \sigma^{2}} \tag{3.11}
\end{equation*}
$$

As $n U_{T} \simeq 34 \mathrm{mV}$ and $\sigma=80 \div 300 \mathrm{mV}$, it is easy to see how this results in a gate voltages range which is not compatible with our technology.

### 3.2.1 Approximated Normalization

An output current $I_{j}$ proportional to the conditional probability $p\left(V_{T}^{R} \mid V_{T}(j)\right)$ can be obtained [57] with the circuit shown in Fig.3.5, which consists of 16 differential pairs whose nMOS are biased in weak inversion so that their input voltage - output current relation is an hyperbolic tangent. The differential pairs input voltages $V_{d_{j}}$ are given by:

$$
\begin{equation*}
V_{d_{j}}=\alpha \cdot\left[V_{T}^{R}-\frac{V_{T}(j)+V_{T}(j-1)}{2}\right] \tag{3.12}
\end{equation*}
$$

where the scaling factor $\alpha$ is equal to:

$$
\begin{equation*}
\alpha=\frac{n U_{T} \Delta}{\sigma_{n}^{2}} \tag{3.13}
\end{equation*}
$$

$\Delta$ indicates the distance between adjacent threshold voltage levels, which in our case is equal to 400 mV .

As an example, let's consider the output current $I_{1}^{\prime}=I_{1 / 0}-I_{2 / 1}$ which corresponds to the conditional probability $p\left(V_{T}^{R} \mid V_{T}=V_{T}(1)\right)$. According to (3.9), $I_{1}$ can be written as:

$$
\begin{equation*}
I_{1}=I_{N} \frac{p\left(V_{T}^{R} \mid V_{T}=V_{T}(1)\right)}{\sum_{i \in \mathcal{A}} p\left(V_{T}^{R} \mid V_{T}=V_{T}(i)\right)} \tag{3.14}
\end{equation*}
$$

Let's define

$$
\begin{aligned}
& p\left[V_{T}^{R} \mid V_{T}=V_{T}(0)\right]=p\left[V_{T}^{R} \mid 0\right] \\
& p\left[V_{T}^{R} \mid V_{T}=V_{T}(1)\right]=p\left[V_{T}^{R} \mid 1\right] \\
& p\left[V_{T}^{R} \mid V_{T}=V_{T}(2)\right]=p\left[V_{T}^{R} \mid 2\right]
\end{aligned}
$$

where $V_{T}(0), V_{T}(1)$ and $V_{T}(2)$ correspond to the programmed threshold voltage levels 0 V , 400 mV and 800 mV respectively. The circuit of Fig.3.5, generates the current $I_{1}^{\prime}$ which, according to (1.9), is given by:

$$
\begin{equation*}
I_{1}^{\prime}=I_{N} \frac{p\left[V_{T}^{R} \mid 1\right]}{p\left[V_{T}^{R} \mid 0\right]+p\left[V_{T}^{R} \mid 1\right]}+I_{N} \frac{p\left[V_{T}^{R} \mid 2\right]}{p\left[V_{T}^{R} \mid 1\right]+p\left[V_{T}^{R} \mid 2\right]} \tag{3.15}
\end{equation*}
$$

The (3.15) can be written as:

$$
\begin{equation*}
I_{1}^{\prime}=I_{N} \frac{p\left[V_{T}^{R} \mid 1\right]^{2}-p\left[V_{T}^{R} \mid 0\right] \cdot p\left[V_{T}^{R} \mid 2\right]}{p\left[V_{T}^{R} \mid 1\right]^{2}+p\left[V_{T}^{R} \mid 1\right] \cdot p\left[V_{T}^{R} \mid 2\right]+p\left[V_{T}^{R} \mid 0\right] \cdot p\left[V_{T}^{R} \mid 1\right]+p\left[V_{T}^{R} \mid 0\right] \cdot p\left[V_{T}^{R} \mid 2\right]} \tag{3.16}
\end{equation*}
$$

The factor $p\left[V_{T}^{R} \mid 0\right] \cdot p\left[V_{T}^{R} \mid 2\right]$ can be discarded as it is much smaller than all the other ones, as it is related to two non adjacent voltage levels. Thus (3.16) becomes:

$$
\begin{equation*}
I_{1}^{\prime}=I_{N} \frac{p\left[V_{T}^{R} \mid 1\right]}{p\left[V_{T}^{R} \mid 1\right]+p\left[V_{T}^{R} \mid 2\right]+p\left[V_{T}^{R} \mid 0\right]} \tag{3.17}
\end{equation*}
$$

which shows how the calculated conditional probability is an approximation of the ideal one. However, we will see in Sec.3.2.2 how this approximation is good enough for our purposes.

The expression for the input differential pair voltage $V_{d_{j}}$ given by equation (3.12) can be deduced by recalling that:

$$
\begin{equation*}
I_{D_{j}}=\frac{I_{N}}{1+\exp \left(\frac{-\alpha V_{d_{j}}}{n U_{T}}\right)} \tag{3.18}
\end{equation*}
$$

which, in order to satisfy equation (3.15), must be equal to:

$$
\begin{equation*}
I_{D_{j}}=\frac{I_{N}}{1+\exp \left(\frac{-\left(V_{T}^{R}-V_{T}(j-1)\right)^{2}+\left(V_{T}^{R}-V_{T}(j)\right)^{2}}{2 \sigma_{n}^{2}}\right)} \tag{3.19}
\end{equation*}
$$

By comparing (3.18) to (3.19), we obtain the expression for $V_{d_{j}}$ :

$$
\begin{equation*}
V_{d_{j}}=\frac{n U_{T}\left(V_{T}(j)-V_{T}(j-1)\right)}{\sigma_{n}^{2}}\left[V_{T}^{R}-\frac{\left(V_{T}(j)+V_{T}(j-1)\right)}{2}\right] \tag{3.20}
\end{equation*}
$$

which is equal to (3.12) if we note that $\left(V_{T}(j)-V_{T}(j-1)\right)=\Delta$.

### 3.2.2 Voltage to Probability Module Design

The voltage to probability module, which schematic is shown in Fig.3.6, receives as inputs the differential voltages $V_{d_{j}}$ calculated according to (3.12) and produces as outputs the currents $I_{j}$, which are proportional to the conditional probabilities $p\left(V_{T}^{R} \mid V_{T}(j)\right)$. In particular, the $16 I_{j}$ currents are given by:

$$
\begin{align*}
I_{0} & =I_{0 / 1} \\
I_{1} & =I_{1 / 0}-I_{2 / 1} \\
I_{2} & =I_{2 / 3}-I_{1 / 2} \\
I_{3} & =I_{3 / 2}-I_{4 / 3} \\
I_{4} & =I_{4 / 5}-I_{3 / 4} \\
I_{5} & =I_{5 / 4}-I_{6 / 5} \\
I_{6} & =I_{6 / 7}-I_{5 / 6} \\
I_{7} & =I_{7 / 6}-I_{8 / 7}  \tag{3.21}\\
I_{8} & =I_{8 / 9}-I_{7 / 8} \\
I_{9} & =I_{9 / 8}-I_{10 / 9} \\
I_{10} & =I_{10 / 11}-I_{9 / 10} \\
I_{11} & =I_{11 / 10}-I_{12 / 11} \\
I_{12} & =I_{12 / 13}-I_{11 / 12} \\
I_{13} & =I_{13 / 12}-I_{14 / 13} \\
I_{14} & =I_{14 / 15}-I_{13 / 14} \\
I_{15} & =I_{15 / 14}
\end{align*}
$$

where $I_{i / j}$ indicates the output current of the differential pair, which compares the threshold voltage levels $i$ and $j$, that is larger when the threshold voltage level read from the memory cell is equal to $V_{T}(i)$.


| 4 -states |  |
| :---: | :---: |
| $M_{p}$ | $\frac{0.5}{2}$ |
| $M_{c_{1}}$ | $\frac{2}{1}$ |
| $M_{c_{2}}$ | $\frac{2}{0.5}$ |
| $M_{n}$ | $\frac{40}{0.4}$ |
| $M_{b}$ | $\frac{2}{4}$ |

Table 3.7: Transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$

It is worth to notice how, with the scheme described above, each differential pair output current is used just once. This is important because it avoids the introduction of more current mirrors in the V2P cell, thus increasing the cell current consumption.

The design of the voltage to probability cell follows the same considerations as for the decoder core cells B-F. In particular, we first impose the same current consumption for both voltage to probability and decoder core cells, which means a bias current of $I_{N}=I_{b} / 15$ for each differential pair of the voltage to probability module. However, we saw how this choice had a negative impact on the overall decoder performance due to two different reasons:

- the voltage to probability settling time became the dominant factor in determining the decoding time, thus slowing down the overall decoder speed;
- the decoder precision was heavily affected by the the fact that the sum $I_{x}$ of the input currents $I_{x_{j}}$ was 15 times lower than the decoder cell bias current $I_{b}$.

As the settling time of cells with the same current density $I_{b} / W$ is faster for those cell which have a higher bias current $I_{b}$, we decided to set $I_{N}=I_{b}$.

As the approximate normalization relies on the hyperbolic tangent voltage to current characteristics of the differential pair nMOS, they have been sized so as to work always under weak inversion. This guarantees a more precise input probabilities to the decoder core.

In order to improve the cell accuracy, a cascode current mirror instead of a simple one has been used in all the differential pairs where no current subtraction is performed, as opposite to the nodes where the currents are subtracted, which exhibit a relatively low impedance.

The transistor size for both 4 -state and 8 -state decoder voltage to probability module are reported in Table 3.7.


Figure 3.7: Conditional probabilities calculated with the approximated normalization method and ideal at $\mathrm{SNR}=24 \mathrm{~dB}$ and $\mathrm{SNR}=30 \mathrm{~dB}$

As already stressed out in Sec.3.2.1, this method calculates an approximated normalization of the conditional probabilities, whose precision increases with SNR.

In Fig.3.7, the ideal conditional probabilities are compared with those calculated with the approximated normalization at two different input SNR of 24 dB and 30 dB respectively, while in Fig.3.8 the corresponding probability error is reported.

At $\mathrm{SNR}=24 \mathrm{~dB}$ the error introduced by the approximated normalization is less than 0.06. If we consider that at this SNR the noise standard deviation $\sigma_{n}=0.555 \cdot \Delta=222 \mathrm{mV}$, with an error equal to $2 \sigma_{n}$ on the read threshold voltage, we move to the adjacent voltage level, which from Fig.3.7 corresponds to a probability error grater than 0.6.

The maximum probability error as a function of the input SNR is shown in Fig.3.9, where we can see how it decreases at high SNR. For SNR grater than 30dB, the error remains around 0.01 . This effect is due to the fact that the differential pair input voltages $V_{d_{j}}$ are saturated to a maximum peak to peak value of 600 mV .

The effect of mismatch and process variations on the voltage to probability error has been estimated by means of MonteCarlo simulations. 100 cell instances have been simulated, each one with 1.000 different input vectors. The results are reported in Table 3.8.



Figure 3.8: Conditional probabilities error at $\mathrm{SNR}=24 \mathrm{~dB}$ and $\mathrm{SNR}=30 \mathrm{~dB}$


Figure 3.9: Conditional probability error

| $m_{m_{\varepsilon}}$ | $\sigma_{m_{\varepsilon}}$ | $m_{\sigma_{\varepsilon}}$ | $\sigma_{\sigma_{\varepsilon}}$ | $\max \sigma_{\sigma_{\varepsilon}}$ | $\min \sigma_{\sigma_{\varepsilon}}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $1.218 e^{-2}$ | $2.248 e^{-1}$ | $5.802 e^{-2}$ | $2.397 e^{-2}$ | $8.739 e^{-2}$ | $7.287 e^{-7}$ |

Table 3.8: Output currents error statistic for V2P

### 3.3 Decoder Optimization: the Reset

In order to make the decoding of a new frame independent from the elaboration results of the previous one, we decided to reset the cells to an uniform state probability before starting the decoding of a new word.

The reset has also the advantage to speed up the decoding time, as demonstrated by Fig.3.10, where the output currents transient with and without reset is shown.

We can see that if the decoding of a new frame starts from a reset state, we can avoid a lot of spurious commutations in the decoder output currents.

The uniform state probability is obtained by forcing the output currents of cells B and C to be all equal to $I_{b} / 8$. This is done by adding pass-transistors to the output current mirrors, so as to short the gate voltage of all the pMOS and thus forcing all the currents flowing through the output transistors to be equal.

At the same time, cell B, C, F1 and F2 inputs are reset to the uniform state probability, so as to generate output currents all equal to $I_{b} / 8$. Thus cell B and C inputs are set to:

$$
\begin{aligned}
& p\left(V_{T k}^{R} \mid S_{0, k}\right)=p\left(V_{T k}^{R} \mid S_{3, k}\right)=0.5 \\
& p\left(V_{T k}^{R} \mid S_{1, k}\right)=p\left(V_{T k}^{R} \mid S_{2, k}\right)=0
\end{aligned}
$$

while cell F1 and F2 inputs reset configurations are given by:

$$
\begin{aligned}
& p\left(a_{1, k}=1 \mid S_{0, k}\right)=p\left(a_{1, k}=0 \mid S_{3, k}\right)=0.5 \\
& p\left(a_{1, k}=1 \mid S_{1, k}\right)=p\left(a_{1, k}=1 \mid S_{2, k}\right)=p\left(a_{1, k}=1 \mid S_{3, k}\right)=0 \\
& p\left(a_{1, k}=0 \mid S_{0, k}\right)=p\left(a_{1, k}=0 \mid S_{1, k}\right)=p\left(a_{1, k}=0 \mid S_{2, k}\right)=0
\end{aligned}
$$

and:

$$
\begin{aligned}
& p\left(a_{2, k}=1 \mid S_{3, k}\right)=p\left(a_{2, k}=0 \mid S_{0, k}\right)=0.5 \\
& p\left(a_{2, k}=1 \mid S_{0, k}\right)=p\left(a_{2, k}=1 \mid S_{1, k}\right)=p\left(a_{1, k}=1 \mid S_{2, k}\right)=0 \\
& p\left(a_{2, k}=0 \mid S_{1, k}\right)=p\left(a_{2, k}=0 \mid S_{2, k}\right)=p\left(a_{1, k}=0 \mid S_{3, k}\right)=0
\end{aligned}
$$

respectively.

(a) Without reset

(b) With reset

Figure 3.10: Output probabilities transient with and without reset

The decoding time for the 8 -state decoder with and without reset has been simulated at $\mathrm{SNR}=27 \mathrm{~dB}$ and $\mathrm{SNR}=28 \mathrm{~dB}$, showing how the reset speeds up the decoding latency especially at high SNR. This is due to the fact that at high SNR the decisions are more clear, which means that after a frame has been decoded the transistors are either switched off or completely on. If a nMOS has to commute for the next frame from the switched off state, this requires a long time. During this time, the information coming from the other cells in the chain can be more updated and cause the cell outputs to have spurious commutations.

If the decoding of each codeword starts from an uniform state probability, these spurious commutations can be avoided, thus improving the decoding speed.


Figure 3.11: BER vs SNR for a memory cell with $q=8$ without ECC (circle), a memory cell with $q=16$ and 4 -state TCM C++ model (diamonds) and with TCM analog decoder (squares)

### 3.4 Overall Decoder Performance

The performance of both 4 -state and 8 -state decoder have been estimated by means of transistor-level simulations. In each simulation, the threshold voltage levels stored in 64 memory cells are read and applied to an AWGN channel. Then the conditional probabilities are calculated and the corresponding 4 input currents $I_{x}$ are fed to the decoders, which are given a time $T_{D}=1.3 \mu$ s to settle. The soft output of the decoders is then sliced and used to estimate the BER and the decoders are reset to uniform state probability in a time $T_{R}=100 \mathrm{~ns}$, thus giving an overall decoding time of $1.4 \mu \mathrm{~s}$.

The BER vs SNR curves for the 4 -states and 8 -states analog decoder are reported in Fig.3.11 and Fig.3.12 respectively. These curves are compared with the benchmark obtained with a C++ behavioral model of the decoders. In the model, the sum-product operations of the MAP decoder are ideal, with probabilities represented by double precision numbers and no source of distortion, offset or noise is taken into account. In both cases the simulations results show a performance loss of 0.5 dB with respect to the benchmark at a BER $=10^{-4}$, in line with the results already found for all-analog Turbo decoders [14].


Figure 3.12: BER vs SNR for a memory cell with $q=8$ without ECC (circle), a memory cell with $q=16$ and 8 -state TCM C++ model (diamonds) and with TCM analog decoder (squares)

Each point of the BER vs SNR curves is considered reliable when a minimum number of 100 erroneous bit is detected. As the number of frames to be simulated increases at lower BER, it was not possible to simulate the decoders behavior at SNR greater than 28.5 dB , due to the excessive computational load of transistor-level simulations. For both decoders, the BER at $\mathrm{SNR}=28.5 \mathrm{~dB}$ was estimated with a reduced number of 50 errors.

The effect of transistor mismatch has been estimated for the 4 -state decoder by means of transistor-level MonteCarlo simulations at $\mathrm{BER}=10^{-3}$, which corresponds to a $\mathrm{SNR}=$ 27 dB . In this case, the BER has been estimated with a reduced number of 50 errors, due to the long computational time required. The simulations results with 100 MonteCarlo iterations reported in Fig. 3.13 show for the BER a mean value $m=2.67 \cdot 10^{-3}$ with standard deviation $\sigma=1.54 \cdot 10^{-4}$, which corresponds to a loss between 0.4 and 0.6 dB with respect to the BER benchmark of $9.07 \cdot 10^{-4}$. The deviation with respect to the typical case is of 0.1 dB , in accordance with the simulation and experimental data reported in [14].

The decoder core estimated area occupation is $0.32 \mathrm{~mm}^{2}$ for the 4 -state one with a power consumption of 4 mW at 1.8 V supply, which become $0.55 \mathrm{~mm}^{2}$ and 16.5 mW re-


Figure 3.13: 4-state decoder MonteCarlo simulations at $\mathrm{SNR}=27 \mathrm{~dB}$
spectively for the 8 -state decoder.
The area occupation for the interface circuitry, that is the voltage to probability modules, has been estimated in $4.8 \mathrm{~mm}^{2}$ for the 8 -state decoder, with a power consumption of 13.8 mW , which reduce to $1.25 \mathrm{~mm}^{2}$ and 3.4 mW for the 4 -state one respectively.

### 3.5 Conclusions

The simulation results reported in this work suggest that a full analog implementation of a TCM decoder for multilevel flash memories can achieve a decoding speed comparable with the state-of-the-art linear block codes occupying a small area, with a BER close to that of the ideal decoding algorithm.

It is worth to notice how the most area and power consuming circuitry is not the one implementing the decoder core, but that used to realize the interface between the memory array and the decoder core itself. This can be traced back to the complexity of dealing with memory cells with 16 levels.

Transistor-level MonteCarlo simulations show how the analog decoder is robust with respect to transistors mismatch. However, as the performance loss in terms of BER of the 4 -state decoder with respect to the 8 -state one is only 0.5 dB while the power consumption increases by over a factor of 2 , our work demonstrates how the analog approach is all the
more a competitive solution for ECC in multilevel flash memories as far as the decoder states number is kept low.

## 4

# UWB-IR Transceiver Chipset for Sensor Network Applications 

The change in FCC regulations that allows unlicensed communication using pulsed ultrawideband (UWB) signalling has given new momentum to the research in this field, which has roots that can be traced back to the original Marconi spark gap radio. UWB signaling has many attributes that make it attractive for a wide range of applications, from ultra-low-power RFID tags and wireless sensors to streaming wireless multimedia and wireless USB at data rate greater than $1 \mathrm{~Gb} / \mathrm{s}$.

This chapter introduces UWB signaling and regulations, with more details on wireless sensor networks applications, to present a transceiver chipset for UWB Impulse Radio. The specifications, architecture and implementation of a UWB-IR non-coherent receiver are then outlined and a synchronization algorithm is proposed and successfully tested, while the transmitter design will be discussed in details in Chap. 5 .

### 4.1 UWB Definition

In February 2002, the FCC approved the use of the $3.1-10.6 \mathrm{GHz}$ band for UWB communication $[64,65]$, giving birth to a new technology for wireless communication.

The noise emissions limit for digital electronics above 960 MHz is set by the FCC at a constant $-41.3 \mathrm{dBm} / \mathrm{MHz}$ [66]. For example, personal computers are allowed to radiate noise below this level at any frequency above 960 MHz . The original intent of UWB communication was to transmit data within the emissions limits already placed on personal computers. However, due to interference concerns from UWB radiators to other existing wireless services the FCC placed "conservative" requirements on UWB emissions. These limits are shown in Fig.4.1 and reported in Table 4.1. Instead of a constant


Figure 4.1: FCC emissions limit for indoor (dashed) and outdoor (solid) UWB communication
$-41.3 \mathrm{dBm} / \mathrm{MHz}$ above 960 MHz , a deep notch is placed around GPS and PCS services because these receivers have higher sensitivities. GPS operates at 1.2 and 1.6 GHz and PCS at 1.9 GHz . There are also stricter requirements on outdoor or handheld UWB devices than indoor UWB devices.

UWB signaling has been used in the military since 1960's for both communication and radar. The UWB pulses used in radar applications were low-frequency, high power and generated with non-linear devices and transmission lines that can not be integrated in a high volume process. Some of this technologies for low-frequency pulse generation are still actively researched today [67], even if UWB low-power applications are gaining more and more popularity within the international research community.

UWB has several advantages over traditional narrowband architectures. From a channel prospective, the wide bandwidth can offer excellent robustness to multi-path fading [68]. Additionally, the narrow pulses in time offer the ability to perform precise locating combined with communication. UWB has the potential for spatial capacity that is orders of magnitude above other popular wireless standards such as $802.11 \mathrm{a}, 802.11 \mathrm{~b}$ and Bluetooth $[69,70]$.

The main limitations of UWB communication is the presence of strong, in-band interference that can easily saturate the UWB receiver front-end. The overlap between UWB

| Frequency Range <br> $[M H z]$ | Indoor Limit <br> $[d B m / M H z]$ | Outdoor Limit <br> $[d B m / M H z]$ |
| :---: | :---: | :---: |
| Below 960 | FCC 15.209 |  |
| $960-1610$ | -75.3 | -75.3 |
| $1610-1990$ | -53.3 | -63.3 |
| $1990-3100$ | -51.3 | -61.3 |
| $3100-10600$ | -41.3 | -41.3 |
| Above 10600 | -51.3 | -61.3 |

Table 4.1: FCC Mask Limits
and existing services is a major concern in both the transmitter and receiver design, since UWB transmitters will also raise the noise floor seen by narrowband victim receivers.

### 4.2 UWB Sensor Networks

Wireless sensor networks consist of tens to thousands of distributed low complexity nodes that have limitations both on process power and memory, and severe restriction on power consumption. By the very nature of the application, traffic in sensor networks is often bursty with long periods of no activity. For event detection operations, a device may remain idle for long periods, sending only "heart-beat" information, then suddenly be required to send significant amounts of data when an event occurs. For devices involved in continuous monitoring, the flow of traffic will be more stable. However, efficient multiple access, reliability and battery life are still major concerns.

Impulse-Radio-based UWB technology proprieties make it well suited to sensor networks applications. In particular, as already outlined, UWB-IR systems have potentially low complexity and low cost [71] with respect to classical narrow-band radio [72-74] or multi-band OFDM UWB [75,76]. Moreover, they exhibit noise-like signals, are resistant to severe multipath and jamming and have a very good time domain resolution, allowing for location and tracking applications.

The low complexity and low cost of impulse radio UWB systems arise from the essentially baseband nature of the signal transmission. Unlike conventional radio systems, UWB transmitters produce very short time domain pulses that are able to propagate without the need of an additional radio frequency (RF) mixing stage [77]. At the receiver side, the non-coherent energy detection approach may be adopted [78]. Non-coherent com-
munication does not require precise phase control, which allows both the transmitter and the receiver architecture to be simplified, particularly the high frequency circuits that can consume the majority of power in a wireless transceiver. A non-coherent energy detection scheme can further reduce hardware complexity while providing resilience to multi-path fading without the cost of high frequency Rake-base techniques [79]. Non-coherent solutions suffer from a reduced robustness with respect to narrow-band interferers, which can easily cause the receiver front-end to saturate. The IEEE 802.15.4a standard [80] has recognized the advantages offered by non-coherent communication and includes support for it.

The aim of our project is to realize a UWB-IR transceiver chipset for low-data rate wireless sensor netwoks. In particular, our target is a transmission data rate of $100 \mathrm{~kb} / \mathrm{s}$ over a link distance $d$ of at least 10 metres. As non-coherent communication advantages in terms of power consumption are significant especially for short distance links, we decided to use it for our system.

### 4.3 UWB Signal Choice

According to FCC standard, UWB signals must have a minimum continuous signal bandwidth of 500 MHz , a spectral mask of $-41.3 \mathrm{dBm} / \mathrm{MHz}$ within the $3.1-10.6 \mathrm{GHz}$ bandwidth and a peak power limit that can not be exceeded. Thus, the three major design choices for UWB signals are bandwidth, modulation and pulse type.

### 4.3.1 Bandwidth

Although the IEEE 802.15.4a standard physical layer for UWB-IR [81] can operate in several bands of 500 MHz or 1.5 GHz from 3.1 GHz to 10.6 GHz , we prefer to limit the pulse bandwidth from 7.25 GHz to 8.5 GHz . The advantage is twofold: first the system exploits only that subset of the UWB band that is allowed for transmission in USA, Europe and Japan [82]; secondly, the signal is concentrated in the upper part of the UWB spectrum, maximizing the frequency separation from WLAN interferes around 2.4 GHz and 5 GHz .

Thus, our system will operate in a single band $B$ of 1.25 GHz instead of bands of 500 MHz , in order to maximize the transmitted power.


Figure 4.2: Pulse position modulation

(a)

(b)

Figure 4.3: Pulse amplitude modulation

### 4.3.2 Modulation

Although information can be encoded in a UWB signal in a variety of methods, only some modulation schemes are suitable for non-coherent energy detection. Among these, the most popular modulation schemes developed up to date for UWB are pulse-position modulation (PPM) [83], pulse-amplitude modulation (PAM) [84], on-off keying modulation (OOK) [85] and binary phase-shift keying modulation (BFSK).

PPM PPM is based on the principle of encoding information with two or more positions in time, referred to the nominal pulse position, as shown in Fig.4.2. A pulse transmitted at the nominal position represents a 0 and a pulse transmitted after the nominal position represents a 1 . The drawing shows a two-position modulation, where one bit is encoded in one pulse. Additional positions can be used to provide more bits per symbol.

The time delay between positions is typically a fraction of a nanosecond, while the time between nominal positions is typically much longer to avoid interference between pulses.


Figure 4.4: Binary frequency-shift keying modulation


Figure 4.5: Binary phase-shift keying modulation

PAM and OOK PAM is based on the principle of encoding information with the amplitude of the pulses, as shown in Fig.4.3. The drawing shows a two-level modulation, respectively, for zero and lower amplitude, where one bit is encoded in one pulse. When the binary 0 is associated with the zero amplitude pulse, as shown in Fig.4.3.a, we speak of OOK.

As with pulse position, with PAM more amplitude levels can be used to encode more than one bit per symbol.

BFSK In frequency-shift keying (FSK) the digital information is encoded through discrete frequency changes of a carrier wave. The simplest FSK is binary FSK (BFSK). BFSK literally implies using a couple of discrete frequencies to transmit binary information. With this scheme, the 1 is called the mark frequency and the 0 is called the space frequency. The time domain of an FSK modulated carrier is illustrated in Fig.4.4.

BPSK In binary phase-shift keying modulation, information is encoded with the polarity of the pulse, as shown in Fig.4.5. The polarity of the pulses is switched to encode a 0 or a 1 . In this case, only one bit per impulse can be encoded because there are only two polarities available to choose among. Although BPSK modulation stand alone is not suited to non-coherent energy detection, nevertheless BPSK scrambling in addition to PPM is used to eliminate PPM spectrum tone lines.

The choice of the PPM scheme leads to a reduced receiver complexity with respect
to the other modulation schemes but, on the other side, PPM spectrum exhibits tone lines $10 \log _{10}(P R F / 1 \mathrm{MHz})$ above the BPSK spectrum while keeping all other factors, in particular the total pulse energy and the pulse repetition frequency $P R F$, equal [86]. This results in a PPM transmitter having to lower its power by this factor relative to a BPSK transmitter in order to meet the FCC mask. Therefore, high-order PPM or BPSK scrambling in addition to PPM is used to eliminate these tones and thus the need to reduce power [87]. Because BPSK decouples the scrambling problem from the modulation, it is typically preferred over high-order PPM, which adds complexity to the receiver hardware.

The advantage of PPM over BFSK consists in the removal of the two additional filters, centered at the mark and space frequency respectively, required for BFSK non-coherent detection. Indeed, while PAM or OOK needs the comparison with a reference threshold in order to extract the encoded information, with 2-PPM the bit value can be deduced by simply comparing the energy received in two time slots centered at the nominal and shifted time position respectively.

### 4.3.3 Pulse Shape Analysis

There are several pulse shapes found in literature for UWB communication, ranging from spectral inefficient [68,88-90] to precisely controlled frequency tolerance [91, 92]. The performance in terms of BER has been analyzed for a range of pulse shapes and modulation techniques [93-95]. However, we consider three metrics to quantify a pulse shape, that are: spectral efficiency, out-of-band emissions and time-bandwidth product.

Spectral Efficiency The spectral efficiency of a pulse quantifies how well the pulse spectrum utilizes the available bandwidth. The system performances in terms of BER depends only on the received pulse energy [96] and not on its actual shape. Therefore, given an average power limit and a -10dB channel bandwidth in the receiver, the transmitter must fill the channel spectrum as tightly as possible. The spectral efficiency of a pulse is the loss incurred from incomplete filling of the -10 dB channel bandwidth calculated by

$$
\begin{equation*}
\eta_{c h}=\frac{E_{c h}}{P_{F C C} \cdot B_{-10 d B}} \tag{4.1}
\end{equation*}
$$

where $E_{c h}$ is the pulse energy within the -10 dB channel bandwidth, $P_{F C C}$ is the maximum average power spectral density in W/MHz and $B_{-10 d B}$ is the -10 dB bandwidth in MHz .

Out-of-Band Emissions The out-of-band emissions metric of a pulse is the ratio of the energy outside the -10 dB channel to the energy within the -10 dB channel. This metric is used to analyze the adjacent channel interference and it is calculated by

$$
\begin{equation*}
\eta_{\text {out }}=\left(E_{\text {tot }}-E_{c h}\right) / E_{c h} \tag{4.2}
\end{equation*}
$$

where $E_{\text {tot }}$ is the total pulse energy given by

$$
\begin{equation*}
E_{t o t}=\int_{-\infty}^{+\infty} p^{2}(t) d t \tag{4.3}
\end{equation*}
$$

Time-Bandwidth Product The time-bandwidth product is a figure of merit which indicates the localization of a pulse both in time and frequency. The lower this number, the more localized a pulse is in both time and frequency, which generally produces the best combination of performance in both time and frequency domains. The time-bandwidth product is calculated by

$$
\begin{equation*}
T B_{w}=D_{p} \cdot d_{p} \tag{4.4}
\end{equation*}
$$

where

$$
\begin{equation*}
D_{p}^{2}=\frac{1}{2 \pi E} \int_{-\infty}^{+\infty} \omega^{2}|F(\omega)|^{2} d \omega \tag{4.5}
\end{equation*}
$$

and

$$
\begin{equation*}
d_{p}^{2}=\frac{1}{E} \int_{-\infty}^{+\infty} t^{2}|f(t)|^{2} d t \tag{4.6}
\end{equation*}
$$

$F(\omega)$ is the Fourier transform of the time domain pulse $f(t)$ and $E$ is the pulse energy calculated by

$$
\begin{equation*}
E=\int_{-\infty}^{+\infty}|f(t)|^{2} d t=\frac{1}{2 \pi} \int_{-\infty}^{+\infty}|F(\omega)|^{2} d \omega \tag{4.7}
\end{equation*}
$$

We consider five different time domain pulses: sinc, square, $2^{\text {nd }}$ order filtered, root-raised cosine $[97,98]$ and Gaussian pulse. Their performance metrics are reported in Table 4.2. The $2^{\text {nd }}$ order filtered pulse is a square pulse filtered by a $2^{\text {nd }}$ order low-pass filter.

The sinc and root-raised cosine pulses have the highest spectral efficiencies but require the most complex transmitters to be generated. The square pulse is the simplest to generate but it results in the highest out-of-band emissions. The Gaussian pulse has the lowest time-bandwidth product, that is why it is typically preferred and the most common pulse shape found in the literature. The $2^{\text {nd }}$ order filtered pulse performs similarly to the Gaussian pulse, but it requires area and power consuming filters to be generated.

|  | Spectral <br> Efficiency | Out-of-Band <br> Emissions | Time-BW <br> Product |
| :---: | :---: | :---: | :---: |
| Sinc | $100 \%(0 \mathrm{~dB})$ | $0 \%(-\infty \mathrm{dB})$ | $\infty$ |
| Square | $60.0 \%(-2.2 \mathrm{~dB})$ | $12.8 \%(-8.9 \mathrm{~dB})$ | $\infty$ |
| $2^{\text {nd }}$ order filtered | $59.2 \%(-2.3 \mathrm{~dB})$ | $2.8 \%(-15.6 \mathrm{~dB})$ | 0.55 |
| Root-raised cosine | $84.6 \%(-0.7 \mathrm{~dB})$ | $0.4 \%(-23.8 \mathrm{~dB})$ | 0.85 |
| Gaussian | $56.5 \%(-2.5 \mathrm{~dB})$ | $3.3 \%(-14.9 \mathrm{~dB})$ | 0.50 |

Table 4.2: Comparison of different pulse shapes

Following these considerations, we chose the Gaussian pulse shape for our application. However, as this pulse is a relatively complex pulse shape to generate with circuits, we will see in Chap. 5 how a good approximation with almost the same performance can be obtained with a very simple circuit.

### 4.3.4 Channel Model

A fundamental aspect to be taken into account in the design of our transceiver is the channel model. The equation for the path loss of a UWB signal is given by [99, 100]:

$$
\begin{equation*}
L(d)=L_{0}+10 \cdot \gamma \cdot \log _{10}\left(d / d_{0}\right)+S \tag{4.8}
\end{equation*}
$$

where

$$
\begin{equation*}
L_{0}=10 \cdot \gamma \cdot \log _{10}\left(\frac{4 \pi d_{0} f_{c}}{c}\right) \tag{4.9}
\end{equation*}
$$

Usually the reference distance $d_{0}$ is chosen as $1 \mathrm{~m} . f_{c}$ is the signal central frequency, $c$ the light speed in $\mathrm{m} / \mathrm{sec}$ and $S$ represents the shadowing factor of the channel, which is a zeromean Gaussian random variable which indicates the deviation of $L(d)$ from its nominal value. Finally, $\gamma$ represents the severity of the path loss. For line-of-sight, $\gamma$ is measured to be 2 but can grow as large as 3.34 for non-line-of-sight measurements.

The system level paramiters for our project are summarized in Table 4.3.

### 4.4 CMOS Technology

Both receiver and transmitter are designed in the UMC $0.13-\mu \mathrm{m}$ Mixed-Mode and RFCMOS process. The key features of this technology are:

| Modulation | $2-\mathrm{PPM}$ |
| :---: | :---: |
| Pulse bandwidth | $7.25-8.5 \mathrm{GHz}$ |
| Pulse shape | Gaussian |
| Data rate | $100 \mathrm{~kb} / \mathrm{s}$ |
| Link distance | $d \geq 10 \mathrm{~m}$ |

Table 4.3: System level parameters

- minimum channel length: $0.13-\mu \mathrm{m}$;
- dual supply voltage: 1.2 V and 3.3 V ;
- P-substrate;
- single poly, eight metal layers (1P8M);
- Twin-Well and Triple Well;
- Metal Metal capacitors;
- high performance mixed-mode signal capabilities;
- radio frequency MOS transistors.


### 4.5 Receiver

As already outlined in Sec.4.2, the non-coherent energy-detection approach [101] in the receiver may be preferred because it allows to avoid the integration of both a template pulse generator $[102,103]$ and a quadrature frequency synthesizer [104]. In such a noncoherent case, the received energy has to be estimated, performing a windowed integration on the received signal squared. The block diagram of a non-coherent receiver is reported in Fig.4.6. It consists of a low-noise amplifier (LNA), a variable-gain amplifier (VGA) to accommodate variations of the received signal strength and an energy detector, composed of a squarer and a windowed integrator. For each received bit, the integral of the received signal squared is computed separately over the time windows $T_{i n t 1}$ and $T_{\text {int } 2}$ so that a comparator allows to decide whether more energy is allocated in the first or in the second window.


Figure 4.6: Receiver block diagram

The receiver has been designed in UMC $0.13-\mu \mathrm{m}$ technology by Andrea Gerosa and Marco D'Aguanno and it is now under test.

The main features of each block are summarized hereafter.

LNA and VGA The LNA is an inductively degenerated common-source amplifier with a resonant load. It exploits fully differential conversion with embedded impedance matching by means of a monolithic integrated transformer.

The VGA has been designed with two stacked stages that exploit the same bias current, in order to minimize its power consumption. It also has resonant loads with capacitor tuning to adjust the tank center frequency.

The LNA and VGA parameters are reported in Table 4.4.

Energy detector The energy detector squarer exploits the quadratic non-linearity of MOS devices working under strong inversion region, as described in [105], to obtain an output current proportional to the squared input voltage. The integrator, realized by means of a transimpedance amplifier, integrates the squared signal over a capacitor. Two capacitors are used to estimate the received energy in two consecutive time slots.

The overall energy detector main parameters are reported in Table 4.5.

| Parameter | Value |
| :--- | :---: |
| LNA gain | 14 dB |
| LNA noise figure | 5 dB |
| LNA iIP3 | -10 dBm |
| LNA bias current | $500 \mu \mathrm{~A}$ |
| LNA + VGA gain | $15 \div 30 \mathrm{~dB}$ |
| LNA + VGA noise figure | 5.2 dB |
| LNA + VGA iIP3 | -10 dBm |
| LNA + VGA bias current | 1 mA |
| LNA + VGA area | $800 \mu \mathrm{~m} \times 500 \mu \mathrm{~m}$ |

Table 4.4: LNA + VGA parameters

| Parameter | Value |
| :--- | :---: |
| Energy detector conversion gain | $8.5 \mathrm{~mA} / \mathrm{V}^{2}$ |
| Current consumption | 1.4 mA |

Table 4.5: Energy detector parameters

### 4.5.1 Behavioral Model

In order to evaluate the performance of the receiver at system level so as to derive the transmitter specifications, a behavioral model of the whole receiver has been realized using Matlab. The block diagram of the model is shown in Fig.4.7. Signal $V_{R X}$ is generated modulating an ideal Gaussian pulse at a given energy with a random bitstream and adding the thermal noise at the antenna. The blocks in the first row of Fig.4.7 model the two gain stages. The circuit noise due to the LNA and the VGA is added to signal $V_{R X}$ as a white Gaussian noise, whose power depends on the amplifying stages noise figure that has been estimated by means of transistor-level simulations. A third-order non linearity is also accounted for, modeling the LNA transfer function with a third-order power series, extrapolating coefficient $b_{3}$ from the input-referred intercept point iIP3. Finally, three biquadratic filters, whose pass-band corresponds to the pulse bandwidth, model the inherent frequency selectivity of the LNA input matching network and of the resonant loads in the LNA and VGA.

The squarer is modeled using a power series to account for its non linear relationship


Figure 4.7: Matlab receiver equivalent model
between the input voltage and the output current. The power series is given by

$$
\begin{equation*}
I_{\text {out }}=\sum_{i=0}^{N} a_{i} \cdot V_{\text {in }}^{i} \tag{4.10}
\end{equation*}
$$

where the $a_{i}$ coefficients have been extracted from transistor-level simulations. In particular, a classical two-tone test with input signal components spaced by 100 MHz , namely at 8 GHz and 8.1 GHz , has been performed. The magnitude of the different harmonic components allows to estimate the coefficients $a_{i}$ values which are reported in Table 4.6. Similarly, the integrator model mimics its transistor-level frequency responce.

The circuit noise due to the squarer and to the integrator has been estimated with transistor-level simulations as well, and it is added as an equivalent noise source at the integrator output.

The integration results at the end of the two integrating phases $\Phi_{1}$ and $\Phi_{2}$ of Fig.4.6 are sampled and held before being compared to decide which of the two windows contains more energy. It is worth to notice that the integrator output voltage is clipped at 500 mV , in order to account for limited output swing of the real circuit.

### 4.5.2 BER and Sensitivity

In order to quantify the receiver performance, the uncoded BER is estimated as a function of the input power at the antenna, as reported in Fig.4.8. As shown, the minimum input power for which a BER lower than $10^{-3}$ is estimated, is $\simeq-95 \mathrm{dBm}$. The receiver performance are summarized in Table 4.7, where a process gain $G_{p}=5 \mathrm{~dB}$ has been accounted

| Parameter | Value |
| :--- | :---: |
| Data rate | $100 \mathrm{~kb} / \mathrm{s}$ |
| Pulses per bit | 10 |
| LNA + VGA gain | $6 \mathrm{~dB}-28 \mathrm{~dB}$ |
| LNA + VGA noise figure | 5.2 dB |
| LNA + VGA iIP3 | -10 dBm |
| $a_{1}, a_{2}, a_{4}$ | $1.6 \cdot 10^{-4}, 8.83 \cdot 10^{-3},-0.15$ |
| $A_{D C}$ | $233 \mathrm{~dB} \Omega$ |
| $\omega_{1}$ | 3 Hz |
| $\omega_{2}$ | 1.9 GHz |
| Energy detector noise power | $5.11 \cdot 10^{-6} \mathrm{~V}^{2}$ |

Table 4.6: Parameters of the receiver behavioral model
for, deriving from the transmission of 10 pulses per bit and the use of a by majority decision rule, as described in details in Sec.5.1.

Although the expected link margin of 6.8 dB would allow a link distance greater than 10 m , as low-data rate transmissions are peak power limited, we will see in Sec.5.1 how we need to reduce the average transmitted power to -17.7 dBm in order to be compliant with the FCC limits. This leads to a link margin of $\simeq 2 \mathrm{~dB}$.

### 4.6 Synchronization Algorithm

In order to maintain the low complexity nature of the receiver, we decided to use a synchronization algorithm based on the energy collection strategy to synchronize the receiver and the transmitter clocks before demodulation, as the one described in [106]. The algorithm is based on a preamble which contains 4 repetition of a $N_{c}=31$ bits Gold code sent at the maximum repetition frequency $P R F_{\text {syn }}=1 /\left(2 \cdot T_{i n t}\right)$, followed by an inverted Gold code sequence used to indicate the end of the synchronization phase. The structure of each data packet is reported in Fig.4.9. During the synchronization phase, the analog front end just integrates and then compares the energy received in consecutive time slots of duration equal to $T_{i n t}$. The data generated is then parallelized into two data streams sent to two identical correlators banks, as shown in Fig.4.10. The 31 correlators of each bank correlate the received data with shifted version of the 31 bit Gold code to account

| Parameter | Value |
| :--- | :---: |
| Throughput $\left(R_{b}\right)$ | $100 \mathrm{~kb} / \mathrm{s}$ |
| Pulses per bit $\left(N_{b}\right)$ | 10 |
| Bandwidth $(B)$ | 1.25 GHz |
| FCC Limit $\left(P_{F C C}\right)$ | $-41.3 \mathrm{dBm} / \mathrm{MHz}$ |
| Maximum TX Power $\left(P_{\max }=P_{F C C}+10 \log _{10}(B / 1 \mathrm{MHz})\right.$ | -10.33 dBm |
| Gaussian pulse spectral efficiency $(E)$ | -2.5 dB |
| Average TX Power $\left(P_{T}=P_{\max }+E\right)$ | -12.83 dBm |
| Path Loss @ $1 \mathrm{~m}\left(L_{1}\right)$ | 50.37 dB |
| Path Loss @ $10 \mathrm{~m}\left(L_{2}\right)$ | 20 dB |
| RX Power $\left(P_{R}=P_{T}-L_{1}-L_{2}\right)$ | -83.2 dBm |
| Process Gain $\left(G_{p}=10 \log _{10}\left(\sqrt{N_{b}}\right)\right)$ | 5 dB |
| Average noise power $\left(N=-174+10 \log _{10}\left(R_{b} N_{b}\right)\right)$ | -114 dBm |
| RX noise figure $\left(N_{f}\right)$ | 12 dB |
| Total noise power $\left(P_{N}=N+N_{f}\right)$ | -102 dBm |
| Minimum $E_{b} / N_{0}$ for $10^{-3} \mathrm{BER}(S)$ | 17 dB |
| Link margin $M=P_{R}+G-P_{N}-S$ | 6.8 dB |

Table 4.7: Parameters of the receiver behavioral model
for all the possible time differences between transmitter and receiver. Synchronization is declared if any, but only one, of the accumulator exceeds a programmed threshold.

The drawback of a such a simplicity is that the synchronization time resolution is equal to the integration window duration $T_{\text {int }}$ itself. In fact, different synchronization strategies have been developed to achieve higher synchronization time resolution, but at the price of an increased receiver complexity [107] or of a long locking time [108].

In most cases, synchronization is declared such as the pulse to be detected lies completely within a single integration window, as sketched in Fig.4.11.a. However, it may happen that the receiver is synchronized in a way such that a fraction of the pulse falls outside the correct integration window, as shown in Fig.4.11.b. The latter event impairs the detection capability of the receiver, because part of the signal energy is not accounted for in the integration result. Due to the mentioned finite time resolution of the algorithm, the only way to preserve the receiver performance is to make the probability of this event negligible, using a window duration $T_{i n t}$ sufficiently larger than the pulse duration. This


Figure 4.8: Simulated BER as a function of the input signal power at the antenna, using the model parameters given in Table 4.6

| 31 bit Gold Code | 31 bit Gold Code | 31 bit Gold Code | 31 bit Gold Code | Inverted Gold Code | Header | 1024 bit Payload |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |

Figure 4.9: Data packet structure
has also the advantage of reducing the clock frequency of the correlators, as synchronization data stream is sent at a frequency equal to $2 \cdot P R F_{\text {syn }}=1 / T_{i n t}$. On the other hand, a large integration window would worsen the receiver performance in terms of sensitivity, as shown in Fig.4.12 where the receiver BER is simulated for different $T_{\text {int }}$ values. In fact, larger integration windows reduce the SNR for the same signal power at the antenna [109], due to both a larger noise energy and to the finite output resistance of the integrator. Following these considerations we singled out $T_{i n t}=15 \mathrm{~ns}$ as a good design compromise.

The performance of the synchronization algorithm in terms of probability of detection $P_{d}$, that is the probability that synchronization is declared when the input and the Gold code are aligned, is reported in Fig.4.13. Such a probability is always larger than 0.9 that is generally considered a reasonable benchmark [110]. The probability of false acquisition


Figure 4.10: Block diagram of synchronization correlator banks


Figure 4.11: PPM signaling scheme with: (a) perfect synchronization; (b) half $T_{\text {int }}$ misalignment; (c) modified PPM scheme with half $T_{\text {int }}$ misalignment


Figure 4.12: Simulated BER as a function of the input signal power at the antenna with $T_{i n t}=5 \mathrm{~ns}(\mathrm{star}), T_{\text {int }}=10 \mathrm{~ns}($ circle $), T_{i n t}=15 \mathrm{~ns}$ (square) and $T_{i n t}=30 \mathrm{~ns}$ (diamond)
$P_{f a}$, that is the probability that synchronization is declared when the input and the Gold code are misaligned, is always less than $10^{-7}$.

Fig.4.14 compares the BER in case of perfect synchronization (as in Fig.4.8) with the one obtained including the synchronization phase, assuming a payload size equal to 1024 bits. Almost no power loss can be observed at the sensitivity; however the BER curve shows a floor at $10^{-4}$.

The floor is caused by the residual probability of the event illustrated in Fig.4.11.b. In particular, whenever the integration window misalignment is such that the received pulse is split across the two PPM windows, the noise influence on the demodulation result dominates regardless the signal power. This generates a BER floor above that input signal power for which this misalignment becomes the most relevant error source. A way to avoid this eventuality, is to insert a time slot $T_{i n t}=15 \mathrm{~ns}$ between the two PPM windows, as shown in Fig.4.11.c. As a consequence, the BER curve does not exhibit any floor, as reported in Fig.4.14.


Figure 4.13: Probability of detection


Figure 4.14: Simulated BER as a function of the input signal power at the antenna with perfect synchronization (square), original PPM synchronization (circle) and modified PPM synchronization (diamond)

## $\square$

## Transmitter

This chapter presents the design of the transmitter for the UWB-IR low-data rate wireless sensor network described in Chap.4. In particular, the transmitter specifications are derived first; then, a possible architecture for the overall system is presented to focus on the design of a novel energy efficient Gaussian pulse generator.

### 5.1 Specifications

As already outlined in Sec.4.1, the FCC limits the output power in the $3.1-\mathrm{to}-10.6 \mathrm{GHz}$ band in two ways $[64,65]$ :

1. The average power spectral density must be less or equal to -41.3 dBm . This corresponds to a theoretical maximum total power of -10.3 dBm for a 1.25 GHz bandwidth signal. In practice, this number is reduced by $2-4 \mathrm{~dB}$ due to pulse generation constraints.
2. The peak power may not exceed 0 dBm at the UWB signal center frequency $f_{c}$ in a 50 MHz resolution bandwidth (RBW). Since most spectrum analyzers are not equipped with a 50 MHz IF filter, the peak power measurement is typically performed at a lower RBW and the limit is conservatively set to be

$$
P_{p k} \leq 0 \mathrm{dBm}+20 \log _{10}(R B W / 50 \mathrm{MHz})
$$

Communication distance in a non-coherent energy-detecting UWB system is maximized when the SNR seen at the receiver during the integration window is maximized. This occurs when the transmitter generates maximum total output power under the regulatory limits. Since sensor networks typically communicate at low data rates, large amplitude pulses transmitted at the data rate are required to maximize power, and thus


Figure 5.1: Maximum single pulse amplitude allowed by the FCC average power spectral mask as a function of the PRF
communication distance, under FCC spectral masks. The voltage amplitude $V_{\max }$ for single pulses transmitted at a data rate equal to PRF are reported in Fig.5.1. We can see as a peak-to-peak voltage swing of $17.8 V_{p p}$ would be required to maximally satisfy FCC average power spectral mask at a PRF of 100 kHz .

On the other hand, while high data rate pulsed-UWB transmitters are typically average power limited, low data rate transmitters are typically peak power limited [111], as shown in Fig. 5.2 where the peak power corresponding to the transmission of a single pulse as a function of the pulse amplitude $V_{\max }$ together with the FCC limit is reported. For the peak power estimation a spectrum analyzer resolution bandwidth of 3 MHz has been considered.

The peak power as a function of the pulse voltage amplitude for a data rate transmission of $100 \mathrm{~kb} / \mathrm{s}$ is reported in Fig.5.3. We can see how a maximum peak-to-peak voltage swing of $9 V_{p p}$, corresponding to a $V_{\max }=4.5 \mathrm{~V}$ is allowed to be compliant with the FCC mask limit. However, this is still impractical for deep-submicron CMOS technologies, where supply voltages are of the order of 1 V .


Figure 5.2: Peak power of a maximum amplitude single pulse as a function of the PRF


Figure 5.3: Peak power as a function of the single pulse voltage amplitude for a $100 \mathrm{~kb} / \mathrm{s}$ data rate transmission


Figure 5.4: Maximum output voltage as a function of the pulses number per bit

An alternative approach to generate large swing pulses while maximizing total power under FCC masks is to reduce output voltage swings and increase the PRF, that is equivalent to transmit multiple pulses per bit. As the transmitter is to be implemented in the UMC $0.13-\mu \mathrm{m}$ CMOS process described in Sec. 4.4 with a maximum supply voltage of 1.2 V , and monolithic transformer with transformer ratio greater than 3 show poor performance, we fixed as a reasonable limit for the maximum output swing $\simeq 3 \mathrm{~V}$. Then we calculated the link distance as a function of the number of pulses per transmitted bit with the receiver parameters reported in Sec.4.5.2.

The maximum output voltage depends not only on technology limits but also on the FCC average power mask, as during the synchronization phase pulses are transmitted at a maximum $P R F_{\text {syn }}=1 /\left(2 \cdot T_{\text {int }}\right)$ and the average power is measured on a data packet basis, according to [64]. Increasing the output swing leads to a reduction of the pulse repetition frequency during synchronization phase $P R F_{\text {syn }}$, thus protracting the time required to synchronize transmitter and receiver. This has a negative impact on the energy per bit of the overall system, as during the synchronization phase the receiver is always switched on.

As pulses with a smaller amplitude can be more efficiently generated than large voltage swing ones, we decided to transmit more pulses per bit instead of a single pulse with


Figure 5.5: Link distance as a function of the pulses number per bit with the pulses amplitude given by Fig.5.4
a larger amplitude. We can see from Fig.5.4 and Fig.5.5 that by transmitting $N_{b}=10$ pulses per bit with a pulse output swing of $V_{\max }=1.6 \mathrm{~V}$, that is $3.2 V_{p p}$, we can reach a link distance of 12.4 m , which corresponds to a link margin of $\simeq 2 \mathrm{~dB}$.

However, due to FCC peak power limits, the pulse repetition frequency during synchronization $P R F_{\text {syn }}$ must be reduced to $1 /\left(2 \cdot 3 \cdot T_{\text {int }}\right)=11.1 \mathrm{MHz}$ instead of 33.3 MHz . This has no impact on the BER performance but it slightly increases the receiver hardware complexity as two banks of 6 correlators each, instead of 2 , have to be implemented.

The estimated energy per pulse, that is the energy spent to transmit a single pulse, is shown in Fig.5.6, together with the energy per bit of the overall system, including the amount of power spent during the synchronization phase. In particular, for a transmission of 10 pulses per bit, the estimated energy per bit is 3.2 nJ . This shows better performance than the result reported in [106], where an energy/bit of 2.5 nJ over a link distance of 3 m is declared, without including the energy spent to synchronize the receiver and the transmitter.

Following these considerations, we can derive the transmitter specifications, which are summarized in Table 5.1.



Figure 5.6: Energy/pulse and receiver (diamond), transmitter (circle) and overall (square) energy/bit as a function of the number of pulses per bit

| Parameter | Value |
| :--- | :---: |
| Throughput $\left(R_{b}\right)$ | $100 \mathrm{~kb} / \mathrm{s}$ |
| Pulses per bit $\left(N_{b}\right)$ | 10 |
| Bandwidth $(B)$ | 1.25 GHz |
| Carrier frequency $\left(f_{c}\right)$ | 7.875 GHz |
| Maximum output voltage $\left(V_{\max }\right)$ | 1.6 V |
| Energy per pulse $\left(E_{p}\right)$ | $\leq 120 \mathrm{pJ} /$ pulse |

Table 5.1: Main transmitter parameters


Figure 5.7: Transmitter block diagram

### 5.2 Architecture

The transmitter block diagram is shown in Fig.5.7. It consists of a digital controlled oscillator (DCO) which generates the carrier frequency $f_{c}=7.875 \mathrm{GHz}$ and a frequency divider by 16 , which generates the control signal $V_{R F}$ for the combined mixer and power amplifier (MXR-PA). The MXR-PA receives at its input the differential carrier frequency signal $V_{L O}$ and the pulse control signal $V_{R F}$ and produces at its output a Gaussian pulse with central frequency $f_{c}=7.875 \mathrm{GHz}$ and bandwidth $B=1.25 \mathrm{GHz}$. The carrier frequency $f_{c}$ accuracy is controlled by means of a phase-aligned frequency-locked loop (PA-FLL). The control signal $V_{R F}$ is further divided by 8 and compared with an external reference clock at 61.5 MHz by means of an early-late detector. A binary search algorithm is then implemented to adjust the carrier frequency $f_{c}$. However, it is worth to notice how a noncoherent signaling scheme does not require precise frequency tuning. Thus, the control system proposed reaches the required accuracy with a relative implementation simplicity. To exploit the low duty cycle nature of UWB-IR systems, the transmitter is activated at each pulse transmission by means of an external control signal. As we transmit a number of 10 pulses per bit, the pulses are sent at a repetition frequency of 1 MHz , that is 10 times the nominal data rate of $100 \mathrm{~kb} / \mathrm{s}$.

As the frequency accuracy requirement can be relaxed in non-coherent energy detecting systems, the VCO is implemented by means of a three-stage ring oscillator in order to ensure a fast start-up. However, its design is still at the preliminary stage, as the one of PA-FLL loop, and will not be discussed further. In the next section, we will thus describe


Figure 5.8: Gaussian pulse generator
in details the design of the Gaussian pulse generator.

### 5.3 Gaussian Pulse Generator

The variety of Gaussian pulse generators present in literature is very wide, ranging from all-digital CMOS pulse generators [112-114] to Gilbert cell BiCMOS based solutions [111]. However, the former reach a maximum output swing which is of the order of hundreds of millivolts, while the latter usually require discrete low-pass filters to meet the FCC mask requirements. We will see how a Gaussian pulse can be efficiently generated by the circuit shown in Fig. 5.8 that simultaneously permforms up-conversion to the central frequency $f_{c}=7.875 \mathrm{GHz}$ and mixing.

The basic idea is to generate a Gaussian pulse by multiplying a triangular shape pulse and the differential local oscillator output signal. In order to generate the required peak-to-peak output swing of $3.2 V_{p p}$ and perform the differential to single ended conversion, a monolithic transformer with transformer ratio $r=2$ is used.

The nMOS transistor M0 of Fig. 5.8 is driven by a triangular pulse $V_{T R}$ while transistors M1, M2, M3 and M4 act as switches driven by the local oscillator output signals $V_{L O+}$ and $V_{L O-}$. To have an idea of the design parameters values, we can consider the small signal analysis model of our circuit, even if the triangular pulse driving the bias transistor is a large signal. If we assume $V_{L O+}$ and $V_{L O-}$ to be square waves, the conversion gain is


Figure 5.9: Simplified buffer schematic
given by

$$
\begin{equation*}
\frac{2}{\pi} \cdot g_{m 0} \cdot \frac{R_{L}}{r^{2}} \tag{5.1}
\end{equation*}
$$

where $R_{L}=50 \Omega$ is the antenna resistance and $r$ the transformer ratio. If the local oscillator output waves are not square but sinusoidal, the pMOS and nMOS switches will be on simultaneously for a certain amount of time, giving rise to a conversion gain loss.

The triangular pulse $V_{T R}$ has a slope of $1.2 \mathrm{~V} / \mathrm{ns}$ as it has to rise from 0 V to $V_{d d}=1.2 \mathrm{~V}$ in 1 ns . However, the accuracy requirements for the $V_{T R}$ generation are very relaxed, as the output signal spectrum does not show any significant variation due to imprecise triangular pulse generation. As a consequence, $V_{T R}$ can be easily generated from the square control signal $V_{R F}$ by means of a cascade of two inverters.

It is worth to notice how the proposed circuit does not need any bias current, thus greatly improving the overall system efficiency.

However, in order to reduce the gain loss, we need to keep the switches resistance quite low. In particular, if we assume a minimum drain-source voltage of $\simeq 200 \mathrm{mV}$ for M0, so as to keep it always working in saturation, as the maximum transformer primary coil current is equal to

$$
I_{\max } \simeq \frac{r \cdot V_{\max }}{R_{L}}=64 \mathrm{~mA}
$$

the switches resistance as to be in the order of few Ohms. This leads to very large nMOS and pMOS transistors with a gate capacitance of the order of pFs . In order to drive such a capacitance, we use an inductor as feedback network for the switches buffers, so as to resonate the gate capacitance, as shown in Fig.5.9. The buffers are realized by means of an inverter. In order to avoid DC power consumption, both buffers are activated by means of a switch in series with the inverters nMOS transistors.

| $M X R-P A$ |  |  |
| :---: | :---: | :---: |
| $M 0$ | $\frac{32}{0.12}$ | 14 |
| $M 1$ | $\frac{32}{0.12}$ | 14 |
| $M 3$ | $\frac{32}{0.12}$ | 28 |

Table 5.2: MXR-PA transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$

| Buffer |  |  |
| :---: | :---: | :---: |
| $M_{N}$ | $\frac{4.8}{0.12}$ | 24 |
| $M_{P}$ | $\frac{1.8}{0.12}$ | 24 |
| Passive Elem. |  |  |
| $C_{\text {switch }}$ | 1.2 pF |  |
| $L_{\text {switch }}$ | 340 pH |  |

Table 5.3: Passive elements values and buffer transistor size in $\mu \mathrm{m} / \mu \mathrm{m}$


Figure 5.10: Transformer equivalent models

The transistor size for the Gaussian pulse generator are reported in Table 5.3, while those for the buffers together with the switches gate capacitance and the feedback inductor values are reported in Table 5.3. The third column of both tables indicates how many transistors are connected in parallel.

### 5.3.1 Transformer Design

As already outlined, a transformer with transformer ratio $r=2$ is needed in order to generate the required transmitter output power. We decided to implement it by means of an on-chip monolithic transformer, created by magnetically coupling two inductors.

As the transformer is used to achieve output matching in our power amplifier, it will be necessary to resonate some of the transformer inductance to minimize the loss [115]. A capacitor is also necessary at the primary side of the transformer to adjust its input reactance to the desired value for the driving transistors. This can be done using a parallel capacitor on the primary and another capacitor in series with the secondary, as shown in Fig.5.10, where a simplified transformer model is reported, together with the series and parallel tuning capacitance $C_{s}$ and $C_{p}$. In the model, $k$ indicates the coupling factor, $n$ the turn ratio between primary and secondary coils, $R_{L}$ the load resistance and $R_{p}$ and $R_{s}$ the primary and secondary inductor series resistances. $R_{p}$ and $R_{s}$ can be calculated from the inductors quality factors $Q_{p}$ and $Q_{s}$ as:

$$
\begin{equation*}
R_{p}=\frac{\omega L_{p}}{Q_{p}} \quad R_{s}=\frac{\omega L_{s}}{Q_{s}} \tag{5.2}
\end{equation*}
$$

The two transformer inductors, combined with two capacitors, can be used not only to achieve output matching but also to obtain a fourth-order bandpass ladder filter, so as to reduce the power amplifier out-of-band emissions. Thus, the four reactive elements values will be chosen so as to ensure a good efficiency and at the same time a filtering effect.

Given a load resistance of $50 \Omega$, we want to determine the capacitors and inductors values that allow to maximize the transformer efficiency $\eta$. This is defined as the ratio of the power delivered to the load $P_{\text {load }}$ to the total power delivered into PORT1 of the network, that is $P_{\text {total }}=P_{\text {diss }}+P_{\text {load }}$. It can be shown that the transmitter efficiency $\eta$ is equal to

$$
\begin{align*}
\eta & =\frac{P_{\text {load }}}{P_{\text {diss }}+P_{\text {load }}} \\
& =\frac{R_{L} / n^{2}}{\frac{R_{L}}{n^{2}}+\frac{\omega L_{p}}{Q_{s}}+\frac{\omega L_{p}}{Q_{p}} \cdot\left(\frac{R_{L} / n^{2}+\omega L_{p} / Q_{s}}{k \omega L_{p}}\right)^{2}} \tag{5.3}
\end{align*}
$$

where we assume $L_{s} \simeq n^{2} L_{p}$ and we use for $C_{s}$ the value given by

$$
\begin{equation*}
C_{s}=\frac{1}{\omega^{2} L_{s}} \tag{5.4}
\end{equation*}
$$

as it allows to cancel one of the terms at the denominator of (5.3). By differentiating equation (5.3), we can obtain the optimum value of $L_{p}$ resulting in the highest possible $\eta$, which is:

$$
\begin{equation*}
\omega L_{p}=\frac{R_{L}}{n^{2} \sqrt{\frac{1}{Q_{s}^{2}}+\frac{Q_{p}}{Q_{s}} \cdot k^{2}}} \tag{5.5}
\end{equation*}
$$



Figure 5.11: $L_{p}$ optimum values as a function of the coupling coefficient $k$

In Fig.5.11 the optimum value of $L_{p}$ as a function of the coupling coefficient $k$ with transformer ratio $r=n \cdot k=2$ and $Q_{p}=10$ for different values of $Q_{s}$ is reported, while the optimum values for $L_{s}$ and $C_{s}$ are shown in Fig.5.12 and Fig.5.13 respectively. The corresponding transformer efficiency is shown in Fig.5.14. If we assume $Q_{p}=Q_{s}=10$, as on-chip inductors with quality factor of the order of 10 can be realized without too much effort,, with components values

$$
\begin{align*}
L_{p} & =98 \mathrm{pH} \\
L_{s} & =2.45 \mathrm{nH}  \tag{5.6}\\
C_{s} & =166.7 \mathrm{fF}
\end{align*}
$$

we can reach a theoretical efficiency of $\simeq 65 \%$. With these reactive elements values, we will see how we can also create a fourth-order bandpass ladder filter, as the one shown in Fig.5.15, where the input source is represented with a current generator with infinite output resistance.

The values for the two capacitors $C_{1}$ and $C_{2}$, together with those for the two inductors $L_{1}$ and $L_{2}$, can be derived for different types of filters from the values tabulated for lowpass ladder filters [116] applying a low-pass to band-pass transformation. By means of


Figure 5.12: $L_{s}$ optimum values as a function of the coupling coefficient $k$


Figure 5.13: $C_{s}$ optimum values as a function of the coupling coefficient $k$


Figure 5.14: Transformer efficiency as a function of the coupling coefficient $k$
the transformer equivalent model of Fig.5.10, we can derive the primary and secondary coils values as:

$$
\begin{align*}
L_{p} & =L_{1} \\
L_{s} & =\frac{n^{2} \cdot k^{2} \cdot L_{2}}{1-k^{2}} \tag{5.7}
\end{align*}
$$

while the capacitors value are given by

$$
\begin{align*}
C_{p} & =C_{1} \\
C_{s} & =\frac{C_{2}}{n^{2} \cdot k^{2}} \tag{5.8}
\end{align*}
$$



Figure 5.15: Forth-order bandpass ladder filter


Figure 5.16: $L_{p}$ as a function of the filter $Q$ factor for different filter types

We consider seven different filters types, that is a Butterworth filter and six different Chebyshev filters with in-band ripple ranging from 0.1 dB to 3 dB . They correspond to filter type 1 up to 7 respectively. Figures $5.16,5.17$ and 5.18 show $L_{p}, C_{p}$ and $C_{s}$ values for filter $Q$ factors ranging from 1 to 6 , where $Q$ is defined as

$$
\begin{equation*}
Q=\frac{1}{\Delta}=\frac{\omega_{c}}{\omega_{2}-\omega_{1}} \tag{5.9}
\end{equation*}
$$

where $\Delta$ is the filter fractional bandwidth and $\omega_{c}$ the central frequency, which in our case is equal to

$$
\begin{equation*}
\omega_{c}=2 \pi f_{c}=\sqrt{\omega_{2} \omega_{1}} \tag{5.10}
\end{equation*}
$$

We can see how a $Q$ factor grater than 2 requires a primary inductance smaller than 100 pH , which is hardly implemented on-chip. Thus, we chose for our design $Q=2$. The secondary inductor value has to simultaneously satisfy two equations: $L_{s}=n^{2} L_{p}$ in order to ensure the required transformer ratio, and equation (5.7) so as to obtain the desired ladder filter. The corresponding values for the secondary inductor are plotted in Fig.5.19.a for the first condition and in Fig.5.19.b for the second one. By recalling the optimum components values to maximize the transformer efficiency given by (5.6), we


Figure 5.17: $C_{p}$ as a function of the filter $Q$ factor for different filter types


Figure 5.18: $C_{s}$ as a function of the filter $Q$ factor for different filter types


Figure 5.19: $L_{s}$ values with $Q=2$ as a function of the coupling coefficient $k$ for different filter types

| $L_{p}[p H]$ | $L_{s}[n H]$ | $C_{p}[p F]$ | $C_{s}[f F]$ | $k$ |
| :---: | :---: | :---: | :---: | :---: |
| 98 | 2.45 | 4.25 | 167 | 0.4 |

Table 5.4: Transformer parameters


Figure 5.20: Transformer layout
can see that a fourth order bandpass filter of type 4, which corresponds to a Chebyshev filter with in-band ripple of 0.5 dB , can be implemented with the reactive element values reported in Table 5.4.

Implementation The transformer has been implemented using concentrically wound planar spirals [117], as shown in Fig.5.20, where the actual transformer layout is reported. Using this configuration, the common periphery between the two windings is limited to just a single turn. Therefore, mutual coupling between adjacent conductors contributes mainly to the self-inductance of each winding and not to the mutual inductance between the windings. As a result, the concentric spiral transformer has less mutual inductance and more self-inductance than the interwound configuration, giving it a lower $k$-factor. However, this does not represent a limitation in our case, as we need a coupling coefficient $k=0.4$.

The electrical lumped model of the transformer has been derived from the physical


Figure 5.21: Transformer primary coil lumped model
layout by means of electromagnetic simulations. These have been performed with the software Momentum ${ }^{\mathrm{TM}}$ from Agilent Technologies. In particular, first the primary and secondary inductor models have been derived separately. Then, a transformer compact model, as the one described in [118], has been derived.

To model the primary inductor, the two-PI model [119] reported in Fig.5.21 has been used. In this model, $L_{0}$ is the inductance, $R_{0}$ the resistance of the metal strip, $L_{1}$ and $R_{1}$ model the surface layer inductance and resistance. They are evenly split into two parts and the coupling coefficient $k_{m}$ cross couples the two parts in order to properly capture the inductive coupling among metal lines. $C_{o x}$ is the oxide capacitance, $C_{s u b}$ and $R_{s u b}$ are the substrate capacitance and resistance while $C_{s}$ is the edge-to-edge feedthrough capacitance. In addition, we use $C_{s}$ and $R_{s c}$ to model the line-to-line coupling capacitance and direct turn-to-turn electric coupling through the dielectric materials and the conductive substrate respectively.

The equivalent inductance, resistance and quality factor for the EM-simulation and the lumped model, whose parameters are reported in Table 5.5, are shown in Fig.5.22.

| $L_{0}[p H]$ | $R_{0}[m \Omega]$ | $L_{1}[n H]$ | $R_{1}[\Omega]$ | $k_{m}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 97.79 | 479 | 2.81 | 856.4 | 0.177 |
|       <br> $C_{\text {ox }}[f F]$ $R_{\text {sub }}[\Omega]$ $C_{\text {sub }}[f F]$ $R_{\text {sc }}[\Omega]$ $C_{s}[f F]$ $C_{c}[f F]$ <br> 4.53 540.03 40.21 9.40 0.4 0 |  |  |  |  |

Table 5.5: Primary inductor lumped model parameters


Figure 5.22: Primary equivalent inductance, resistance and quality factor from the EMsimulation (solid) an the lumped model (dashed)

The secondary inductor can be better modeled by means of the PI-model described in [120] and reported in Fig.5.23. In this case, the skin effect is modeled by means of resistor $R_{\text {skin }}$ and inductor $L_{\text {skin }}$ whose coupling coefficient with $L_{0}$ is given by $K_{\text {skin }}$. An additional branch constituted by $L_{e d}$ and $R_{e d}$ takes into account the magnetic coupling between the coil and the substrate. The equivalent inductance, resistance and quality factor for the EM-simulation and the lumped model with the parameters reported in Table 5.6, are shown in Fig.5.24.

|  | $L_{0}[n H]$ | $R_{0}[\mathrm{~m} \Omega]$ | $L_{\text {skin }}[f H]$ | $R_{\text {skin }}[\Omega]$ | $k_{\text {skin }}$ |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 2.46 | 11.95 | 446.2 | 3.66 | 0.9 |  |
| $C_{\text {ox }}[f F]$ | $R_{\text {sub }}[\Omega]$ | $C_{\text {sub }}[f F]$ | ] ${ }^{\text {c }}$ S $[f F]$ | $L_{e d}[p H]$ | $R_{\text {ed }}[\Omega]$ | $k_{\text {ed }}$ |
| 17.08 | 220.76 | 90 | 12.29 | 83.26 | 14.04 | 0 |

Table 5.6: Secondary inductor lumped model parameters

Finally, the overall transformer lumped model is derived from the two inductors models by adding a capacitor $C_{a}$ connected between primary and secondary to model the


Figure 5.23: Transformer secondary coil lumped model


Figure 5.24: Secondary equivalent inductance, resistance and quality factor from the EMsimulation (solid) an the lumped model (dashed)


Figure 5.25: Transformer equivalent inductance, resistance and quality factor from the EM-simulation (solid) an the lumped model (dashed)

| $C_{a}[f F]$ | $k$ |
| :---: | :---: |
| 3.4 | 0.403 |

Table 5.7: Transformer lumped model parameters
interwinding capacitance, as described in [118].
The simulations result for the transformer model are shown in Fig.5.25, while its additional parameters values are reported in Table 5.7.

### 5.3.2 Simulation Results

The Gaussian pulse generator has been simulated in typical conditions. Its output transient waveform is reported in Fig.5.26, while the corresponding spectrum is shown in Fig.5.27. We can see how it reaches a maximum output power of $-46 \mathrm{dBm} / \mathrm{MHz}$, according to specification, and it meets the FCC out-of-band masks limits without the need of external filtering. The estimated energy/pulse is $\simeq 120 \mathrm{pJ} /$ pulse, thus in line with the transmitter specifications reported in Table 5.1. This result is in line with the most efficient transmitters for UWB-IR, as the one reported in [114], where $113 \mathrm{pJ} /$ pulse are declared at a data


Figure 5.26: Transient output waveform


Figure 5.27: Output spectrum

| Specification | $[114]$ | $[112]$ | $[113]$ | This work |
| :---: | :---: | :---: | :---: | :---: |
| Data rate | $100 \mathrm{~kb} / \mathrm{s}$ | $1 \mathrm{Mb} / \mathrm{s}$ | $10 \mathrm{Mb} / \mathrm{s}$ | $100 \mathrm{~kb} / \mathrm{s}$ |
| Bandwidth | 500 MHz | 500 MHz | 500 MHz | 1.25 GHz |
| Center frequency | 3.5 GHz | 10 GHz | 4.05 GHz | 7.875 GHz |
| Energy/pulse | 113 pJ | 87.5 pJ | 43 pJ | 124 pJ |
| Efficiency | 0.02 | 0.014 | 0.03 | 0.14 |

Table 5.8: Performance summary of UWB-IR transmitters
rate of $100 \mathrm{~kb} / \mathrm{s}$. However, a more fair comparison can be carried out by considering the transmitter efficiency $\eta_{T}$ defined as

$$
\begin{equation*}
\eta_{T}=\frac{\text { load energy/pulse }}{\text { energy/pulse }} \tag{5.11}
\end{equation*}
$$

where the load energy/pulse indicates the energy per pulse transferred to the load. In [114], with a maximum output swing of 710 mV pp , a central frequency $f_{c}=3.5 \mathrm{GHz}$ and a bandwidth $B=500 \mathrm{MHz}$, we have a transmitter efficiency $\eta_{T}=2.16 \mathrm{pJ} / 113 \mathrm{pJ} \simeq 0.02$. Our transmitter reaches an efficiency of $\eta_{T}=17.5 \mathrm{pJ} / 120 \mathrm{pJ} \simeq 0.15$, thus outperforming the result reported in [114] by a factor of almost 10 .

The performance comparison with state of the art transmitters are reported in Table 5.8, where the ring oscillator power consumption has also taken into account, leading to an estimated energy/pulse of $124 \mathrm{pJ} /$ pulse.

Although the energy consumed by the frequency dividers plus the early-late detector circuit has not been taken into account, results reported in Table 5.8 show how the proposed solution outperforms the state of the art transmitter efficiency by a factor of almost 10.

### 5.4 Conclusions and Future Work

A novel energy efficient transmitter for UWB-IR has been proposed, which uses an original combined mixer and power amplifier to generate a Gaussian pulse with 1.25 GHz bandwith and center frequency of 7.875 GHz . The combined MRX-PA includes a monolithic transformer to reach the maximum output voltage swing required to ensure a link distance of 10 meters with the non-coherent energy detector receiver described in Chap. 4 .

The transformer has been designed so as to maximize the power efficiency and at the
same time to realize a fourth-order ladder filter, in order to reduce the transmitter out-of band emissions.

## Conclusions

The first part of this work, devoted to analog decoding, reports the design of the input interface for an iterative fully analog decoder for a SCCC and of two analog TCM decoders for multi-level Flash memories, all realized in submicron CMOS technologies.

Both projects exhibit the advantages already demonstrated by the analog decoders with respect to their digital counterparts, that is a reduced area occupation, a lower power consumption and an higher throughput. In fact, the SCCC hybrid decoder reaches an efficiency of $2.1 \mathrm{~nJ} /$ bit which outperforms digital decoders with the same block length, that is around 5000 , of a factor up to 50 [25], while the full analog implementations of TCM decoders presented in Chapter 3 can achieve a decoding speed comparable with the state-of-the-art linear block codes occupying a smaller area, with a BER close to that of the ideal decoding algorithm.

At the same time, some traditional limitations of the analog implementations, due mainly to the fact that their circuitry complexity increases linearly with the codeword length, are overcame as the hybrid SCCC decoder is reconfigurable in both block length and code rate.

However, it is worth to notice how the most area and power consuming circuitry for both analog decoder projects is not the one implementing the decoder core intself, but the so-called I/O interface circuitry. This consists of an analog memory and a voltage to probability converter for the SCCC decoder, while it reduces only to the voltage to probability converter for the TCM decoders. Even if the design of an analog memory can be avoided in this latter case, as the data to be processed are already stored in the Flash memory array, the complexity of the interface between the memory array and the decoder core itself increases with respect to the binary case as we deal with memory cells with 16 levels. As a consequence, it is mandatory to optimize the interface design in terms of area, power and speed in order not to spoil the overall system performance.

Moreover, as the performance loss in terms of BER of the 4-state TCM decoder with respect to the 8 -state one is only 0.5 dB while the power consumption increases by over a factor of 2, our work demonstrates how the analog approach is all the more a competitive solution for ECC as far as the decoder states number is kept low.

The second part of this thesis is focused on the design of a transceiver for low data
rate UWB-IR. The transceiver is first analyzed at system level in order to draw the specifications for both receiver and transmitter so as to guarantee a link distance of at least 10 meters.

At the receiver side, the non-coherent energy detection approach has been adopted [78], as it does not require precise phase control, which allows both the transmitter and the receiver architecture to be simplified.

The transmitter uses a novel combined mixer and power amplifier to generate a Gaussian pulse with 1.25 GHz bandwith and center frequency of 7.875 GHz . The combined MRX-PA includes a monolithic transformer to reach a maximum output voltage swing of $3.2 V_{p p}$, necessary to ensure the required link distance. The transformer has been designed in order to maximize the power efficiency and at the same time to realize a fourth-order ladder filter, so as to reduce the transmitter out-of band emissions.

The efficiency of our design has been compared with state-of-the-art UWB-IR transmitters, showing how the proposed solution leads to an improvement in the transmitter efficiency of a factor of almost 10.

A synchronization algorithm has also been proposed and successfully tested, showing how the transceiver chipset is a promising candidate for a low-power UWB-IR.

## A

## Fundamentals of Error Correcting Coding

This appendix, after a brief introduction about general communication/storage systems, presents some basic error correcting codes, with a particular emphases on Trellis Coded Modulation and Turbo codes. The decoding issue is then tackled, to concentrate on all those aspects peculiar to the analog decoding.

## A. 1 The Shannon Limit

The most intuitive way to represent a data communication or storage system is shown in Fig.A.1. This is the famous Figure 1 of most books on error correcting coding theory. An information source emits a sequence of binary digits (bits), called the uncoded sequence u. This sequence is transformed into the coded sequence $\mathbf{x}$ by an encoder and transmitted over a communication channel or a storage medium. During the transmission, the coded sequence $\mathbf{x}$ is corrupted by a noise vector $\mathbf{n}$, where we assume that the noise is of additive nature and that no inter-symbol interference is present. Thus a noisy sequence $\mathbf{y}$ is received at the input of the decoder, whose task is to estimate the most probably sent data sequence $\hat{\mathbf{u}}$ using $\mathbf{y}$.


Figure A.1: SISO simplified diagram with switch between inner and outer configuration

In his 1984 pioneering work [121], Shannon showed that every communication channel has a maximum rate for reliable data transmission, which he called the channel capacity $C$, measured in bits per second. He demonstrated that it is always possible to send information at a rate $R$ lower than $C$ through a channel with an error probability as small as desired by properly encoding the information source. This statement on controlled error probability is not true for rates above $C$.

The capacity of an ideal band-limited channel corrupted by an additive white Gaussian noise (AWGN) is given by the famous formula:

$$
\begin{equation*}
C=B \log _{2}\left(1+\frac{S}{N}\right) \tag{A.1}
\end{equation*}
$$

where $C$ is the capacity in bits per second, B is the channel bandwidth in Hertz and $S / N$ is the signal-to-noise power ratio at the receiver.

However, until the advent of complex Turbo codes [2,122], practical error control schemes have been far away from this theoretical limit. In fact, Shannon's theorem sets a limit on the maximum transmission rate over a channel, but it is silent about the way to reach it. After Turbo codes, an even higher rate has been reached using very large low-density parity-check codes [123], which were originally invented by Gallager [124].

## A. 2 Types of Codes

Most of the codes that are common use today can be distinguish between two main types, block codes and convolutional codes. The output of a block encoder is strictly block oriented and it is generated by combinatorial operations, whereas the convolutional encoders create data streams of possibly infinite length. Additionally, the output of a convolutional encoder is created by a finite-state machine, that is the encoder incorporates memory that tracks the history of the incoming data bits.

## A.2.1 Block Codes

A block code is defined as an algebraic mapping from the vector space $G F(q)^{k}$ over the Galois field $G F(q)$ into the vector space $G F(q)^{n}$, with $n>k$ [39]. If this mapping from one vector space to another is linear, we speak of linear codes.

In block coding, the incoming data stream is segmented into blocks of length $k$ and then mapped into $n$-symbol long codewords. If we assume the information is in the form
of a sequence of symbols belonging to a given alphabet of $q$ symbols (if $q=2$, the symbols are named bits), than we speak of a $(n, k) q$-ary block code. Thus, a $(n, k) q$-ary block code is a set of $q^{k}$ codewords corresponding to the $q^{k}$ possible data blocks.

A linear block code is entirely described by its generator matrix G. A codeword is built using the relation

$$
\begin{equation*}
\mathbf{x}=\mathbf{u} \cdot \mathbf{G} \tag{A.2}
\end{equation*}
$$

where both the codeword $\mathbf{x}$ and the user word $\mathbf{u}$ are assumed to be row vectors. Thus linear codes transform the all-zero input vector into the all-zero codeword. Equivalently, every codeword $\mathbf{x}$ has always to satisfy the equation

$$
\begin{equation*}
\mathbf{H} \cdot \mathbf{x}^{T}=\mathbf{0}^{T} \tag{A.3}
\end{equation*}
$$

where $\mathbf{H}$ is the parity-check matrix, that can be derived from the generator matrix $\mathbf{G}$ such as $\mathbf{G} \mathbf{H}^{T}=0$.

The code rate of a block code is defined as the ratio between the number of bits carrying information and the total codeword length. Thus, if the generator matrix is a full rank $k \times n$ matrix, the code rate is given by

$$
\begin{equation*}
R=\frac{k}{n} \tag{A.4}
\end{equation*}
$$

The Hamming distance between two codewords is the number of positions in which they differ. The minimum distance $d_{\text {min }}$ of a code is the Hamming distance of the pair of codewords with the smallest Hamming distance [39]. A code with a minimum distance $d_{\text {min }}$ can correct up to $t=\left\lfloor\left(d_{\text {min }}-1\right) / 2\right\rfloor$ errors. If $d_{\text {min }}$ is even, the code can simultaneously correct $t=\left(d_{\text {min }}-2\right) / 2$ errors and detect $d_{\text {min }} / 2$ errors.

To ensure a minimum distance equal or greater than $2 t+1$ in any $(n, k) q$-ary block code, the following condition must be satisfied:

$$
\begin{equation*}
n-k \geq \log _{q}\left\{\sum_{i=0}^{t}\left[\binom{n}{i}(q-1)^{i}\right]\right\} \tag{A.5}
\end{equation*}
$$

which is know as Hamming bound.

## A.2.2 Hamming Codes

Hamming codes are a whole class of linear block codes that can correct single errors. If we consider a binary alphabet, that is $q=2$, Hamming codes of length $n=2^{r}-1$ with
$r \geq 2$ are defined to have a parity-check matrix $\mathbf{H}$ whose columns consist of all non-zero binary vectors of length $r$, each used once.

A Hamming code is thus a $n=2^{r}-1, k=2^{r}-1-r, d=3$ block code. The Hamming code as it was first defined is the $(7,4,3)$ Hamming code, whose parity-check matrix is given by [125]:

$$
\mathbf{H}=\left(\begin{array}{lllllll}
0 & 0 & 0 & 1 & 1 & 1 & 1  \tag{A.6}\\
0 & 1 & 1 & 0 & 0 & 1 & 1 \\
1 & 0 & 1 & 0 & 1 & 0 & 1
\end{array}\right)
$$

Any code consisting of rows that are created using linear combinations and column permutations of the original matrix $\mathbf{H}$ is said to be equivalent.

## A.2.3 Convolutional Codes

Unlike block codes, convolutional codes work on data streams of possibly infinite length. An encoder for a convolutional code can be seen as a finite-state machine, that is a sequential logic circuit, with a memory of order $m$. The generator matrix $\mathbf{G}$ of a convolutional code has a general form given by:

$$
G(D)=\left(\begin{array}{cccc}
G_{11}(D) & G_{12}(D) & \cdots & G_{1 n}(D)  \tag{A.7}\\
G_{21}(D) & G_{22}(D) & \cdots & G_{2 n}(D) \\
\vdots & \vdots & \ddots & \vdots \\
G_{k 1}(D) & G_{k 2}(D) & \cdots & G_{k n}(D)
\end{array}\right)
$$

Each element $G_{i j}(D)$ represents a transfer function of a linear discrete-time system of order $m$ :

$$
\begin{equation*}
G_{i j}(D)=\frac{a_{i j, m} D^{m}+a_{i j, m-1} D^{m-1}+\cdots+a_{i j, 0}}{b_{i j, m} D^{m}+b_{i j, m-1} D^{m-1}+\cdots+b_{i j, 0}} \tag{A.8}
\end{equation*}
$$

where $D$ indicates the unit delay element.
The block diagram of a simple convolutional code with generator polynomials

$$
\begin{equation*}
G(D)=\left[D^{2}+1, D^{2}+D+1\right] \tag{A.9}
\end{equation*}
$$

is shown in Fig.A.2.
The rate of a convolutional code is still given by the ratio of the generator matrix dimensions. Thus, the convolutional code of Fig.A. 2 has a code rate $R=1 / 2$.

If the uncoded data $\mathbf{u}$ is part of the codeword, we speak of systematic code. Given the generator matrix of a convolutional code, it is always possible to build its systematic


Figure A.2: 4-state rate $1 / 2$ binary convolutional encoder


Figure A.3: Systematic version of code of Fig.A. 2
version as

$$
G(D)=\left(\begin{array}{cccc}
1 & \frac{G_{12}(D)}{G_{11}(D)} & \cdots & \frac{G_{1 n}(D)}{G_{11}(D)}  \tag{A.10}\\
1 & \frac{G_{22}(D)}{G_{21}(D)} & \cdots & \frac{G_{2 n}(D)}{G_{21}(D)} \\
\vdots & \vdots & \ddots & \vdots \\
1 & \frac{G_{k 2}(D)}{G_{k 1}(D)} & \cdots & \frac{G_{k n}(D)}{G_{k 1}(D)}
\end{array}\right)
$$

The systematic version of the code of Fig.A. 2 is reported in Fig.A.3. Its generator matrix is given by:

$$
\begin{equation*}
G(D)=\left[1, \frac{D^{2}+D+1}{D^{2}+1}\right] \tag{A.11}
\end{equation*}
$$

The systematic version of a convolutional code is also called recursive and exhibits the same performance of the original code.

The definition of the Hamming distance given in Sec.A.2.1 can not be applied to convolutional codes, as they work on codewords of possibly infinite length. Instead, for convolutional code, we speak of minimum Euclidean distance $d_{\text {free }}$ of a code, referring to the same concept [126].


Figure A.4: 4-state transition diagram of code of Fig.A. 2


Figure A.5: 4-state trellis diagram of code of Fig.A. 2

## A.2.4 Convolutional Codes Trellis Diagram

Since a convolutional encoder is a finite-state machine, it may be completely defined by a finite state-transition diagram, such as the 4 -state diagram shown in Fig.A. 4 for the encoder of Fig.A.2. The nodes in the transition diagram are the states of the finite-state machine and the branches represent the possible transitions between states. Each branch is labeled by the user bits $\mathbf{u}$ which cause the transition as well as by the corresponding output codeword $\mathbf{x}$.

If we index the state-transition diagram by both the states and the time index $r$, Fig.A. 4 expands into the trellis diagram of Fig.A.5. This is a two dimensional representation of the operation of the encoder, capturing all possible state transitions starting from an originating state that is usually state 0 . If the finite-state machine is driven back into the original state at a certain time $r=L$, as shown in Fig.A. 6 with $L=8$, we speak of terminated codes. To force the encoder back to the original state, the last $m$ branches, where $m$ is the number of memory elements, are predetermined and no information is


Figure A.6: 4-state terminated trellis diagram of code of Fig.A. 2


Figure A.7: Trellis section of code of Fig.A. 2
transmitted in those time units. This results in a rate loss of

$$
\frac{m k}{n(n+m)}
$$

. In order to avoid this rate loss, tail baiting codes have been proposed in 1986 [127]. The trellis of a tail baiting code is formed by connecting the outgoing states of the last trellis section to the incoming states of the first trellis sections. Such a tail baiting code forms a closed ring structure with no need for termination bits. A valid codeword is then defined by a path starting in any state at a certain point, i.e. non necessary the zero state, and terminating in the same state after one turn.

Since the size of the trellis transition diagram explodes for long user data sequences, a complete description of a convolutional code can also be given by a single section of the trellis diagram. The trellis section for the code of Fig.A. 2 is shown in Fig.A.7.

## A.2.5 Trellis Coded Modulation

 -The use of an error correcting scheme with rate $R=k / n$ to increase the reliability of binary transmission or storage systems, reduces the spectral efficiency from its maximum value $\mu=1 \mathrm{bit} / \mathrm{sec} / \mathrm{Hz}$ to $\mu=R<1 \mathrm{bps} / \mathrm{Hz}$. This leads to a faster signalling rate or to a larger bandwidth if the aim is to guarantee the same rate of the uncoded system.


Figure A.8: Signal constellations

Equivalently, the bit rate has to be reduced by a factor of $1 / B$ so as to keep the transmission symbol rate or bandwidth constant, where $B$ is the channel bandwidth.

To increase the data rate without increasing the bandwidth, Ungerboeck [51] and Imai and Hirakawa [128] used an expanded signal set, such as $2^{m}$-ary PSK or QAM digital modulation, and then applied an error correcting code to increase the Euclidean distance between codewords.

Several signal constellations used in digital communication systems are shown in Fig.A.8. From the viewpoint of digital signal processing, modulation is mapping, that is the process of assigning a $m$-dimensional binary vector $b$ to a signal point $(x(b), y(b))$ in the constellation. Using $2^{m}$-ary modulation instead of the binary one has the advantage that the number of bits per symbol is increased by a factor of $m$, thus increasing the spectral efficiency of the system. On the other hand, the required average energy of the signal increases, as in the QAM case, or the distance between modulation symbols decreases, as with PSK modulation. In practice, transmitted power or storage level is limited to a maximum value. This implies that the signal points become closer to each other. As a result, an error correcting code is needed to reduce the increased error probability and to


Figure A.9: Trellis section of code of Fig.A. 2
improve the system reliability.
The basic idea of TCM is thus to expand the signal constellation in order to obtain the redundancy needed for error correction coding and then to design a trellis code to increase the minimum Euclidean distance between codewords. In fact, the asymptotic code gain of a TCM scheme is given by:

$$
\begin{equation*}
G=10 \log _{10}\left(\frac{d_{\text {free }}^{2}}{d_{\text {unc }}^{2}}\right) \tag{A.12}
\end{equation*}
$$

where $d_{u n c}^{2}$ indicates the minimum squared Euclidean distance between uncoded signal sequences.

The design of a TCM, which is the joint design of a trellis code and a modulation scheme, is performed using a mapping by set partitioning, as proposed by Ungerboeck


Figure A.10: TCM encoder
in [51]. A basic trellis structure, associated with the state transitions of a finite-state machine, is selected and signal subsets mapped to trellis branches. Uncoded signals are assigned to parallel branches so as to increase system spectral efficiency.

Thus, a $2^{m}$-ary modulation signal set $S$ is partitioned in $m$ levels. For $1 \leq i \leq m$, at the $i$-th partition level, the signal set is divided into two subsets $S_{i}(0)$ and $S_{i}(1)$, such that the intra-set distance, $\delta_{i}$, is maximized. A label bit $b_{i} \in\{0,1\}$ is associated with the subset choice, $S_{i}\left(b_{i}\right)$, at the $i$-th partition level. This partition process results in a labelling of the signal points. Each signal point in the set has a unique $m$-bit label $b_{1} b_{2} \cdots b_{m}$ and is denoted by $s\left(b_{1}, b_{2}, \cdots, b_{m}\right)$. With this Ungerboeck partitioning of a $2^{m}$-ary modulation signal constellation, the intra-set distances are in nondecreasing order $\delta_{1}^{2} \leq \delta_{2}^{2} \leq \cdots \leq$ $\delta_{m}^{2}$. This strategy corresponds to a natural labelling for $M$-PSK modulations, i.e., binary representations of integers, whose value increases clockwise (or counter-wise). Fig.A. 9 shows a natural mapping of bits to signals for the case of a 8-PSK modulation, with $\delta_{1}^{2}=0.586, \delta_{2}^{2}=2$ and $\delta_{3}^{2}=4$. Ungerboeck regarded the encoder "simply as a finite-state machine with a given number of states and specified state transitions". He gave a set of pragmatic rules to map signal subsets and points to branches in a trellis. These rules can be summarized as follows:

1. all subsets should occur in the trellis with equal frequency and with a fair amount of regularity and symmetry;
2. state transitions that begin or end in the same state should be assigned to subsets


Figure A.11: Parallel concatenated convolutional code block diagram
separated by the largest Euclidean distance;
3. parallel transitions are assigned to signal points separated by the largest Euclidean distance, that is the highest partition levels.

The general structure of a TCM encoder is shown in Fig.A.10. In the general case of a rate $(m-1) / m$ TCM system, the trellis structure is inherited from a $k /(k+1)$ convolutional encoder. The uncoded bits introduce parallel branches in the trellis.

## A.2.6 Turbo Codes

In 1993, Berroux [2] presented his first article on Turbo codes, which has turned the view of coding theory upside down. This new coding scheme, whose performance are very close to the Shannon limit [129], consists of two or more convolutional codes connected in series or in parallel by a bit-interleaving structure $\pi$.

In particular, Parallel Concatenated Convolutional Codes PCCC consist of two convolutional encoders working in parallel, as shown in Fig.A.11. The first encoder receives a copy of the user data $\mathbf{u}$ while the second encoder is fed with a scrambled version of the same user data obtained by means of an interleaver $\pi$.

In the Serial Concatenated Convolutional Codes SCCC scheme [130], the two encoder are connected in series, as depicted in Fig.A.12. The first encoder (Outer) transforms the user data $\mathbf{u}$ into a temporary codeword, whose bits are permuted by the interleaver and then fed to the second encoder (Inner). A puncturer can be inserted between the two encoders to delete some parity bits so as to increase the code rate.


Figure A.12: Serial concatenated convolutional code block diagram

## A. 3 Decoding

Basically, the decoding is a decision-making process. Based on the observed data vector $\hat{\mathbf{u}}$, the decoder tries to figure out which information bit or information vector has been generated by the information source.

The so called hard decision algorithms have been widely used in the past, due to their straightforward implementation in the digital domain. In fact, the Viterbi algorithm has been the standard decoding algorithm for most convolutional codes for over a decade.

With the advent of Turbo codes, soft decision algorithms have become popular, because they allow the implementation of an iterative decoding process, necessary to the Turbo code decoding.

## A.3.1 Viterbi Algorithm

In 1967, Viterbi [131] introduced a new algorithm for decoding convolutional codes. It works on the trellis diagram of the code and it decide what codeword have the maximumlikelihood (or the minimum-distance) from the received string.

If we consider a trellis diagram of a convolutional code, as the one shown in Fig.A.5, we can associate to each possible path $\mathbf{x}$ the log-likelihood function

$$
\begin{equation*}
\log L(\mathbf{y} \mid \mathbf{x})=\log P(\mathbf{y} \mid \mathbf{x})=\sum_{i=0}^{n-1} \log P\left(y_{i} \mid x_{i}\right) \tag{A.13}
\end{equation*}
$$

where $\mathbf{y}$ is the received string.
We observe that if the max-log-likelihood path includes the state $s_{k}$ at time unit $k$, then the first $k$ branches of that path constitute the max-log-likelihood path of the partial trellis from time 0 to time $k$.

Therefore it suffices at time $k$ to determine and retain for each possible state $s_{k}$ only the biggest path from the unique state at time 0 to that state. This path is called the "survivor".

The time- $k+1$ survivors may be determined from the time- $k$ survivors by the following recursive "add-compare-select" rules:

1. for each branch from a state at time $k$ to a state at time $k+1$, add the metric of that branch to the metric of the time- $k$ survivor to get a candidate path metric at time $k+1$;
2. for each state at time $k+1$, compare all the candidate path metrics arriving at that state and select the path corresponding to the largest as the survivor. Store the new survivor path.

At the end of the trellis, there is a unique state, whose survivor is the max-loglikelihood path for the received string $\mathbf{y}$.

This regular recursive structure is attractive for software or hardware implementation because it requires only some memory to store the metrics and the temporary paths.

The Viterbi algorithm can be applied to decode the most likely TCM sequence as well, provided that the branch metric generator is modified to include parallel branches. The selection of the winning branch and surviving uncoded bits should be changed as well. The survivor path (or trace-back) memory should include the ( $m-1-k$ ) uncoded bits, as opposed to just one bit for rate- $1 / n$ binary convolutional codes.

However, for practical considerations, it was suggested in [132] that $2^{m}$-ary modulation signal constellations be partitioned in such a way that the cosets at the top two partition levels are associated with the output of a rate $1 / 2$ convolutional encoder. This mapping leads to a pragmatic TCM system. With respect to the general encoder structure shown in Fig.A.10, the value of $k=1$ is fixed. As a result, the trellis structure of a pragmatic TCM remains the same, as opposed to the first TCM proposed by Ungerboeck, for all values of $m>2$. The difference is that the number of parallel branches $m-2$ increases with the number of bits per symbol. This suggests a two-stage decoding method in which, at the first stage, the parallel branches in the trellis "collapse" into a single branch and a conventional off-the-shelf Viterbi decoder can be used to estimate the coded bits associated with the two top partition levels. In a second decoding stage, based on the estimated coded bits and the positions of the received symbols, the uncoded bits are estimated.

## A.3.2 MAP Decision Rule

Recalling the general communication system of Fig.A.1, if the data transmission is assumed to be over a time-invariant, memory-less and feedback-less channel, we can define


Figure A.13: $(7,4,2)$ code tanner graph
the conditional probability

$$
\begin{equation*}
P_{\mathbf{Y} \mid \mathbf{X}}(\mathbf{y} \mid \mathbf{x})=\prod_{i} P_{Y \mid X}\left(y_{i} \mid x_{i}\right) \tag{A.14}
\end{equation*}
$$

This conditional probability, called the a posteriori probability APP, represents the probability of receiving at the decoder input the noisy codeword $\mathbf{y}$ given the sent user data $\mathbf{x}$.

Given the received symbols sequence $\mathbf{y}$, a Maximum a Posteriori Probability algorithm finds the user data sequence $\mathbf{x}_{M A P}$ that maximizes (A.14), which means:

$$
\begin{equation*}
P_{\mathbf{Y} \mid \mathbf{X}}\left(\mathbf{y} \mid \mathbf{x}_{M A P}\right)=\max _{\mathbf{x} \in \mathbf{X}} P_{\mathbf{Y} \mid \mathbf{X}}(\mathbf{y} \mid \mathbf{x}) \tag{A.15}
\end{equation*}
$$

An algorithm implementing the MAP decision rule is the Sum-Product algorithm [133].

## A.3.3 Sum-Product Algorithm

The Sum-Product algorithm can be better described with the help of an example. Let's consider a simple binary block code $(7,4,2)$ with parity-check matrix:

$$
\mathbf{H}=\left(\begin{array}{lllllll}
1 & 1 & 1 & 0 & 0 & 0 & 0  \tag{A.16}\\
1 & 0 & 0 & 1 & 1 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 & 1 & 1
\end{array}\right)
$$

Each row of the parity-check matrix corresponds to a parity equation. Thus, from the parity-check matrix (A.16), we can derive the parity check equations:

$$
\begin{align*}
& x_{1} \oplus x_{2} \oplus x_{3}=0 \\
& x_{1} \oplus x_{4} \oplus x_{5}=0  \tag{A.17}\\
& x_{1} \oplus x_{6} \oplus x_{7}=0
\end{align*}
$$



Figure A.14: Example of weights propagation through a function node
which have to be simultaneously satisfied from any valid codeword.
This code can also be described by means of its Tanner graph [134] reported in Fig.A.13, where the black circles called variable nodes represent the codeword symbols $x_{i}$, with $i=1,2, \cdots, 7$, and the so-called function nodes $\oplus$ indicate the parity relations (A.17).

After receiving a codeword $\mathbf{y}$, we assign a weight to each possible alphabet symbol $a_{k}$, where $a_{k} \in\{0,1\}$ for a binary code:

$$
\begin{equation*}
w_{k}\left(a_{k}\right)=P\left(y_{k} \mid x_{k}=a_{k}\right) \tag{A.18}
\end{equation*}
$$

Thus, the weight of a codeword can be calculated as:

$$
\begin{equation*}
w(\mathbf{x})=\prod_{k} w_{k}\left(x_{k}\right)=P(\mathbf{y} \mid \mathbf{x}) \tag{A.19}
\end{equation*}
$$

For each alphabet symbol $a_{k}$, the sum of the weights of all those codewords that present the value $a_{k}$ in correspondence of the variable $x_{k}$, given by:

$$
\begin{equation*}
\sum_{\mathbf{x} \in \mathbf{X}: x_{k}=a_{k}} w_{k}\left(x_{k}\right) w(\mathbf{x}) \tag{A.20}
\end{equation*}
$$

is proportional to the APP value (A.14) according to:

$$
\begin{equation*}
P\left(x_{k}=a_{k} \mid \mathbf{y}\right) \propto \sum_{\mathbf{x} \in \mathbf{X}: x_{k}=a_{k}} P(\mathbf{y} \mid \mathbf{x}) \tag{A.21}
\end{equation*}
$$

The most probable sent codeword is chosen as the one that maximizes the APP.
The weights associated with each variable node can be seen as messages that propagate thought the graph. The messages passing can be scheduled so as to make the calculation of the maximum APP simple and automatic. The first step consists in assigning to the graph leaf nodes, that are the variable nodes connected to just one function node,


Figure A.15: Example of weights propagation through a variable node
the weights corresponding to all the possible variable nodes values. These weights are then passed on to the closest function node. Each function nodes receives the leaf nodes weights as inputs and calculates the output weights according to the the XOR gate function. In the example if Fig.A.14, the two function node inputs are

$$
\left\{\begin{array}{l}
\left(w_{5}(0), w_{5}(1)\right)=(1,3) \\
\left(w_{4}(0), w_{4}(1)\right)=(2,5)
\end{array}\right.
$$

while its output is given by:

$$
\left\{\begin{array}{l}
w(0)=w_{4}(0) \cdot w_{5}(0)+w_{4}(1) \cdot w_{5}(1)=2+15=17 \\
w(1)=w_{4}(0) \cdot w_{5}(1)+w_{4}(1) \cdot w_{5}(0)=5+6=11
\end{array} \quad \Rightarrow(17,11)\right.
$$

Thus the variable node $x_{1}$ receives the three weights coming from the three neighbor function nodes. The local weight of the node $x_{1}$ is then propagated towards each of the function nodes after being multiplied by the sum of all the weights of the incoming branches but the one towards which it is propagated, as described in Fig.A.15. This sum-product process is repeated until reaching the leaf nodes. At the end we obtain a pair of weights (one incoming and one outgoing) for each graph edge. The total weight associated with each edge, that is the weight of the corresponding node, is given by the sum of these two weights. Thus, if we chose for each node the grater weight among all the ones obtained for all the possible alphabet symbols, we have a MAP decision.

Thus the Sum-Product algorithm computations, as described by Gallager for decoding of LDCP codes [124], can be summarized in the following steps:
initialization: each variable node $x_{k}$ is initialized with the conditional probabilities $P\left(y_{k} \mid x_{k}\right)=\lambda y_{k} \mid x_{k}=\lambda_{x_{k}}^{(0)}, \forall k=1, \cdots, q$ where $q$ indicates the alphabet symbols number


Figure A.16: Iterative decoding scheme
nth iteration, function-to-variable nodes: for each variable node $x_{k}$, we compute the $\oplus$ of all the messages coming from the neighbor variable nodes but $x_{k}$, that is $\forall r$ : $h_{r, k}=1$ :

$$
\begin{equation*}
\lambda_{\oplus_{r} \rightarrow x_{k}}^{(n)}=\bigoplus_{\forall c: h_{r, c}=1, c \neq k} \lambda_{x_{c} \rightarrow \oplus_{r}}^{(n-1)} \tag{A.22}
\end{equation*}
$$

nth iteration, variable-to-function nodes: for each function node $\oplus_{k}$, we need to compute the values of all the variable nodes connected to the function node, that is $\forall c: h_{k, c}=1$ :

$$
\begin{equation*}
\lambda_{x_{c} \rightarrow \oplus_{k}}^{(n)}=\lambda y_{k} \mid x_{k} \odot\left(\bigodot_{\forall r: h_{r, c}=1, r \neq k} \lambda_{\oplus_{r} \rightarrow x_{c}}^{(n)}\right) \tag{A.23}
\end{equation*}
$$

where $\odot$ corresponds to the logical function $\overline{E X O R}$.
final decision: after a fixed iteration number $N$, the sent codeword symbols are computed according to:

$$
\begin{equation*}
\lambda_{\hat{x}_{k}}=\lambda y_{k} \mid x_{k} \odot\left(\bigodot_{\forall r: h_{r, c}=1} \lambda_{\oplus_{r} \rightarrow x_{c}}^{(N)}\right) \tag{A.24}
\end{equation*}
$$

## A.3.4 Iterative Decoding

As the Sum-Product algorithm complexity increases linearly with the code states number, it can not be used for Turbo code decoding. Indeed, due to the interleaver presence, the Turbo codes states number is very high and besides difficult to compute.

Concatenated codes can be efficiently decoded by means of a suboptimum algorithm, whose complexity is almost independent on the interleaver length. This algorithm, called iterative, shows performance close to the Shannon limit [135, 136], even if its effectiveness has not been analytically proved yet.

The iterative decoding scheme is reported in Fig.A.16. The Turbo decoder uses two MAP decoders, one for each constituent code. Each MAP decoder has two inputs, the
channel output and an a priori probability generated by the other decoder, and generates at its output an extrinsic information, that is the information on the received symbols known the constituent code. This extrinsic information constitutes the a priori probability for the other decoder, as it is not correlated with the knowledge of its own constituent code.

The decoding algorithm As already pointed out, each decoder has to take a MAP decision on the bases of two inputs, the channel information and the extrinsic probabilities generated by the other decoder. Thus, the decoder has to compute for all the possible codewords the corresponding APP and then chose among them the user word $\hat{\mathbf{u}}$ with the maximum APP, that is:

$$
\begin{equation*}
\hat{u}_{k}=\max _{i}[A P P(k, i)] \tag{A.25}
\end{equation*}
$$

where $k$ is the symbol under analysis index, $i$ represents every possible alphabet symbol value and $\operatorname{APP}(k, i)$ is defined by:

$$
\begin{align*}
\left.A P P(k, i) \doteq p\left(\mathbf{r}_{1}, \mathbf{r}_{2}\right) \mid u_{k}=1\right) & \left.=\sum_{\mathbf{u}: u_{k}=i} p\left(\mathbf{r}_{1} \mid c_{1}(\mathbf{u})\right) p\left(\mathbf{r}_{2}\right) \mid c_{2}(\mathbf{u})\right) p_{a}(\mathbf{u})  \tag{A.26}\\
p\left(\mathbf{r}_{1} \mid c_{1}(\mathbf{u})\right) & =\prod_{j=1}^{N_{1}} p\left(r_{1 j} \mid c_{1 j}(\mathbf{u})\right)  \tag{A.27}\\
p\left(\mathbf{r}_{2} \mid c_{2}(\mathbf{u})\right) & =\prod_{m=1}^{N_{2}} p\left(r_{2 m} \mid c_{2 m}(\mathbf{u})\right)  \tag{A.28}\\
p_{a}(\mathbf{u}) & =\prod_{l=1}^{K} p_{a}\left(u_{l}\right) \tag{A.29}
\end{align*}
$$

where $\mathbf{u}$ is the transmitted codeword, $\mathbf{r}_{1}$ and $\mathbf{r}_{2}$ indicate the received codewords relative to the two codes, $c_{1}(\cdot)$ and $c_{2}(\cdot)$ are the ideal decoding functions of the two constituent codes, $p_{a}(\cdot)$ represents the a priori probability and $K, N_{1}$ and $N_{2}$ are the symbols number of the user word and of the two codes codeword respectively.

According to Berroux [2], equations (A.28) and (A.29) can be expressed as function of the single codeword symbols instead of the whole codeword $\mathbf{u}$. As a consequence, we can write equation (A.27) as a product of functions defined on the single symbol as well, that is:

$$
\begin{equation*}
A P P(k, i)=\tilde{P}_{1 k}(i) \cdot \tilde{P}_{2 k}(i) \cdot p_{a}(i) \tag{A.30}
\end{equation*}
$$

where $\tilde{P}_{1 k}(i)$ and $\tilde{P}_{2 k}(i)$ solve an opportunely derived non-linear system [2]. For the sake of brevity, only the solutions are reported hereafter:

$$
\begin{align*}
& \tilde{P}_{1 k}(i)=\sum_{\mathbf{u}: u_{k}=i} p\left(\mathbf{r}_{1} \mid c_{1}(\mathbf{u})\right) \prod_{l \neq k} \tilde{P}_{2 l}\left(u_{l}\right) p_{a}\left(u_{l}\right)  \tag{A.31}\\
& \tilde{P}_{2 k}(i)=\sum_{\mathbf{u}: u_{k}=i} p\left(\mathbf{r}_{2} \mid c_{2}(\mathbf{u})\right) \prod_{l \neq k} \tilde{P}_{1 l}\left(u_{l}\right) p_{a}\left(u_{l}\right) \tag{A.32}
\end{align*}
$$

that can be computed as:

$$
\begin{gather*}
\tilde{P}_{1 k}^{(0)}(i)=1, k=1, \cdots, K \\
\vdots  \tag{A.33}\\
\tilde{P}_{1 k}^{(m)}(i)=\sum_{\mathbf{u}: u_{k}=i} p\left(\mathbf{r}_{1} \mid c_{1}(\mathbf{u})\right) \prod_{l \neq k} \tilde{P}_{2 l}\left(u_{l}\right) p_{a}\left(u_{l}\right), k=1, \cdots, K \\
\tilde{P}_{2 k}^{(m)}(i)=\sum_{\mathbf{u}: u_{k}=i} p\left(\mathbf{r}_{2} \mid c_{2}(\mathbf{u})\right) \prod_{l \neq k} \tilde{P}_{1 l}\left(u_{l}\right) p_{a}\left(u_{l}\right), k=1, \cdots, K
\end{gather*}
$$

Log-Likelihood Ratio Algorithm If the symbol alphabet is binary, it can be convenient to use an additive version of the decoding algorithm previously described, especially in digital implementations where multiplications are expensive to implement. The additive version of the iterative decoding algorithm works on the so called Log-Likelihood Ratio, $L L R$, which are defined as:

$$
\begin{align*}
& L_{k}(A P P) \doteq \log \frac{\sum_{\mathbf{u}: u_{k}=0} p\left(\mathbf{r}_{1} \mid c_{1}(\mathbf{u})\right) p\left(\mathbf{r}_{2} \mid c_{2}(\mathbf{u})\right) p_{a}(\mathbf{u})}{\sum_{\mathbf{u}: u_{k}=1} p\left(\mathbf{r}_{1} \mid c_{1}(\mathbf{u})\right) p\left(\mathbf{r}_{2} \mid c_{2}(\mathbf{u})\right) p_{a}(\mathbf{u})} \\
& L_{k} \doteq \log \frac{p\left(r_{k} \mid 0\right)}{p\left(r_{k} \mid 1\right)} \\
& L_{1 j} \doteq \log \frac{p\left(r_{1 j} \mid 0\right)}{p\left(r_{1 j} \mid 1\right)}, j=1, \cdots, N_{1} \\
& L_{2 m} \doteq \log \frac{p\left(r_{2 m} \mid 0\right)}{p\left(r_{2 m} \mid 1\right)}, m=1, \cdots, N_{2}  \tag{A.34}\\
& L_{a} \doteq \log \frac{p_{a}(0)}{p_{a}(1)} \\
& \pi_{1 l} \doteq \log \frac{\tilde{P}_{1 l}(0)}{\tilde{P}_{1 l}(1)} \\
& \pi_{2 m} \doteq \log \frac{\tilde{P}_{2 m}(0)}{\tilde{P}_{2 m}(1)}
\end{align*}
$$

After some computations, for which we refer to the literature [137], we obtain the desired formula:

$$
\begin{equation*}
L_{k}(A P P)=\pi_{1 k}+\pi_{2 k}+L_{a} \tag{A.35}
\end{equation*}
$$



Figure A.17: Soft-Input Soft-Output module


Figure A.18: An edge of the trellis section
where $\pi_{1 k}$ and $\pi_{2 k}$ can be calculated with an iterative process, as described by (A.3.4). It is worth to notice how, using the additive version of the iterative algorithm, only one value has to be propagated as opposite to the two required by its multiplicative version.

## A.3.5 Soft-Input Soft-Output Algorithm

The iterative decoding algorithm, as is it stated in Sec.A.3.4, can not be efficiently used for Turbo codes decoding because it requires two MAP decoders, whose complexity and memory grows linearly with the decoding latency.

In order to overtake this limit, a novel version of the iterative algorithm, called SoftInput Soft-Output, SISO has been introduced [138]. The SISO algorithm uses more modules, one for each constituent code, with real or soft input and output values.

An example of a SISO module is reported in Fig.A.17. Both its two inputs, $\mathbf{P}(c ; I)$ and $\mathbf{P}(u ; I)$ and two outputs, $\mathbf{P}(c ; O)$ and $\mathbf{P}(u ; O)$, represent probability distributions.

To illustrate the decoding algorithm, first presented by Bahl, Cocke, Jelinek and Raviv in 1974 [139], whence the name BCJR, we consider a code trellis section, as the one shown in Fig.A.18. Every trellis section is characterized by:

- a set of $N$ states $\mathcal{S}=\left\{s_{1}, \cdots, s_{N}\right\}$. The state of the trellis at time $k$ is $S_{k}=s$, with

$$
s \in \mathcal{S}
$$

- a set of $N \cdot N_{I}$ edges obtained by the Cartesian product

$$
\mathcal{E}=\mathcal{S} \times \mathcal{U}=\left\{e_{1}, \cdots, e_{N \cdot N_{I}}\right\}
$$

which represent all possible transitions between trellis states.
Indeed, the following functions are associated to each edge $e \in \mathbb{E}$ :

- the original state $s^{S}(e)$ (the projection of $e$ onto $S$ );
- the final state $s^{E}(e)$;
- the input symbol $u(e)$ (the projection of $e$ onto $\mathcal{U}$ );
- the output symbol $c(e)$.

In the case of systematic encoders the pair $\left(s^{S}(e), c(e)\right)$ also identifies the edge since $u(e)$ is uniquely determined by $c(e)$. In the following, we consider only the case in which the pair $\left(s^{S}(e), u(e)\right)$ uniquely identifies the final state $s^{E}(e)$; this assumption is always verified, as it is equivalent to say that, given the initial trellis state, there is a one-to-one correspondence between input sequences and state sequences, a property required for the code to be uniquely decodable.

We also indicate with $\mathbf{P}(c ; I)$ and $\mathbf{P}(u ; I)$ the code and the input a priori distribution respectively, while $\mathbf{P}(u ; O)$ is the user data a posteriori distribution.

If $k$ indicates the discrete index $(k \in\{1, \cdots, n\})$, the BCJR algorithm can be divided into two steps:

- at time $k$, the SISO output probability distributions are computed according to

$$
\begin{align*}
& \tilde{P}_{k}(c ; O)=\tilde{H}_{c} \sum_{e: c(e)=c} A_{k-1}\left[s^{S}(e)\right] P_{k}[u(e) ; I] P_{k}[c(e) ; I] B_{k}\left[s^{E}(e)\right]  \tag{A.36}\\
& \tilde{P}_{k}(u ; O)=\tilde{H}_{u} \sum_{e: u(e)=u} A_{k-1}\left[s^{S}(e)\right] P_{k}[u(e) ; I] P_{k}[c(e) ; I] B_{k}\left[s^{E}(e)\right] \tag{A.37}
\end{align*}
$$

- the quantities $A_{k}(\cdot)$ and $A_{k}(\cdot)$ are obtained through the so-called forward and backward recursions, respectively, as:

$$
\begin{align*}
& A_{k}(s)=\sum_{e: s^{S}(e)=s} A_{k-1}\left[s^{S}(e)\right] P_{k}[u(e) ; I] P_{k}[c(e) ; I] \quad, k=1, \cdots, n-1  \tag{A.38}\\
& B_{k}(s)=\sum_{e: s^{S}(e)=s} B_{k+1}\left[S^{S}(e)\right] P_{k+1}[u(e) ; I] P_{k+1}[c(e) ; I] \quad, k=n-1, \cdots, 1 \tag{A.39}
\end{align*}
$$

with initial conditions

$$
\begin{align*}
& A_{0}(s)= \begin{cases}1 & \text { if } s=S_{0} \\
0 & \text { otherwise }\end{cases}  \tag{A.40}\\
& B_{n}(s)= \begin{cases}1 & \text { if } s=S_{0} \\
0 & \text { otherwise }\end{cases} \tag{A.41}
\end{align*}
$$

The two quantities $\tilde{H}_{c}$ and $\tilde{H}_{u}$ are normalization constants defined as:

$$
\begin{align*}
& \tilde{H}_{c}: \sum_{c} \tilde{P}_{k}(c ; O)=1  \tag{A.42}\\
& \tilde{H}_{u}: \sum_{u} \tilde{P}_{k}(u ; O)=1 \tag{A.43}
\end{align*}
$$

$P_{k}[c(e) ; I]$ in (A.36) and $P_{k}[u(e) ; I]$ in (A.37) are constant with respect to the correspondent sum terms. Thus, defining

$$
\begin{align*}
& P_{k}(c ; O) \doteq H_{c} \frac{\tilde{P}_{k}(c ; O)}{P_{k}(c ; I)}, H_{c}: \sum_{u} P_{k}(c ; O)=1  \tag{A.44}\\
& P_{k}(u ; O) \doteq H_{u} \frac{\tilde{P}_{k}(u ; O)}{P_{k}(u ; I)}, H_{u}: \sum_{u} P_{k}(u ; O)=1 \tag{A.45}
\end{align*}
$$

it can be easily verified that

$$
\begin{align*}
& P_{k}(c ; O)=H_{c} \tilde{H}_{c} \sum_{e: c(e)=c} A_{k-1}\left[S^{S}(e)\right] P_{k}[c(e) ; I] B_{k}\left[s^{E}(e)\right] \\
& P_{k}(u ; O)=H_{u} \tilde{H}_{u} \sum_{e: u(e)=u} A_{k-1}\left[s^{S}(e)\right] P_{k}[c(e) ; I] B_{k}\left[s^{E}(e)\right] \tag{A.46}
\end{align*}
$$

In literature, $P_{k}(c ; O)$ and $P_{k}(u ; O)$ are called extrinsic information, as they represent the added value by the SISO module to the a priori distributions $P_{k}(c ; I)$ and $P_{k}(u ; I)$.

It is worth to notice how the implementation of the two equations (A.46) requires less hardware resources than that of (A.3.4), that is the reason way we used the SISO decoding algorithm for our decoder implementation.

As for the MAP algorithm, an additive version of the SISO algorithm can be easily
derived by considering the logarithm of all the quantities previously defined, that is:

$$
\begin{align*}
\pi(c: I) & \doteq \log \left[P_{k}(c: I)\right] \\
\pi(u: I) & \doteq \log \left[P_{k}(u: I)\right] \\
\pi(c: O) & \doteq \log \left[P_{k}(c: O)\right]  \tag{A.47}\\
\pi(u: O) & \doteq \log \left[P_{k}(u: O)\right] \\
\alpha_{k}(s) & \doteq \log \left[A_{k}(s)\right] \\
\beta_{k}(s) & \doteq \log \left[B_{k}(s)\right]
\end{align*}
$$

The forward and backward recursions can be written as:

$$
\begin{align*}
& \alpha_{k}(s)=\log \left[\sum_{e: s^{\varepsilon}(e)=s} \exp \left\{\alpha_{k-1}\left[s^{S}(e)\right] \pi_{k}[u(e) ; I] \pi_{k}[c(e) ; I]\right\}\right]  \tag{A.48}\\
& \beta_{k}(s)=\log \left[\sum_{e: s^{( }(e)=s} \exp \left\{\beta_{k+1}\left[s^{S}(e)\right] \pi_{k+1}[u(e) ; I] \pi_{k+1}[c(e) ; I]\right\}\right] \tag{A.49}
\end{align*}
$$

where $k \in\{1, \cdots, n-1\}$ for both equations. The initial conditions are given by:

$$
\begin{align*}
& \alpha_{0}(s)= \begin{cases}0 & \text { if } s=S_{0} \\
-\infty & \text { otherwise }\end{cases}  \tag{A.50}\\
& \beta_{n}(s)= \begin{cases}0 & \text { if } s=S_{0} \\
-\infty & \text { otherwise }\end{cases} \tag{A.51}
\end{align*}
$$

The algorithm computation can be further simplified considering the approximation

$$
\begin{equation*}
\log \left[\sum_{i}^{L} \exp \left(a_{i}\right)\right] \simeq \max _{i=1, \cdots, L} a_{i} \tag{A.52}
\end{equation*}
$$

which leads to good results for medium to high signal-to-noise power ratio. Using this simplification we can avoid to compute exponential and logarithm functions, which are quite complex to implement in the digital domain.

However, in the analog domain the implementation of the non additive SISO algorithm is straightforward.

## A. 4 Analog Decoding

In 1998, Hagenauer [140] showed how analog networks could be efficiently used to decode binary Turbo codes. The main attractiveness of the proposed solution was the possibility to get rid of the iteration cycles presented in all the Turbo codes decoding schemes.


Figure A.19: Analog decoders building block

Moreover, he highlighted how there is a simple and straightforward correspondence between the Sum-Product modules and simple analog circuits.

In his work [6], Loeliger proposed an implementation of such a decoder by means of analog VLSI based on Sum-Product module realized with simple analog transistor circuits. In particular, he showed how the whole family of Sum-Product modules can be obtained with small variations of the same basic analog circuit. Thus, the decoder becomes an asynchronous analog network that, after a certain time evolution, settles towards a state that corresponds to the decoded information data. The iterative cycles of the decoding algorithm are then substituted by the continuous time feedbacks of the analog circuit.

Furthermore, analog decoders proved to be robust against all analog circuits nonidealities and able to overcome the performance of their digital counterparts by a factor up to 100 in terms of decoding speed and power consumption.

## A.4.1 Sum-Product Module

The building blocks of an analog decoder are the Sum-Product modules. A generic SumProduct module, as the one shown in Fig.A.19, computes the output probability distribution $P_{Z}$, with $Z \doteq\left\{z_{1}, \cdots, z_{k}\right\}$, from the two input probability distributions $P_{X}$ and $P_{Y}$, with $X \doteq\left\{x_{1}, \cdots, x_{m}\right\}$ and $\mathcal{Y} \doteq\left\{y_{1}, \cdots, y_{n}\right\}$, according to

$$
\begin{equation*}
p_{Z}(z)=\gamma \sum_{x \in X} \sum_{y \in Y} p_{X}(x) p_{Y}(y) f(x, y, z) \forall z \in Z \tag{A.53}
\end{equation*}
$$

where $f$ is a function from $X \times \mathcal{Y} \times \mathcal{Z}$ into $\{0,1\}$ and where $\gamma$ is an appropriate scale factor that does not depend on $z$.

When implemented with analog circuits, a probability distribution $p_{X}$ is represented by means of a current vector $\left(I_{x_{1}}, \cdots, I_{x_{k}}\right)$ such as

$$
\begin{equation*}
\sum_{i} I_{x_{i}}=I_{x} \geq 0 \tag{A.54}
\end{equation*}
$$



Figure A.20: Basic circuit for Sum-Product module implementation

On the other hand, every current vector $\left(I_{1}, \cdots, I_{n}\right)$ with non-negative and all non-zero elements, can be seen as a probability distribution $p_{Y}$ with alphabet $\mathcal{Y} \doteq\left\{y_{1}, \cdots, y_{n}\right\}$, whose values are equal to $p_{Y}\left(y_{i}\right) \doteq I_{i} /\left(I_{1}+\cdots I_{n}\right)$. The analog building block proposed by Loeliger and shown in Fig.A. 20 is derived from the well-known Gilbert multiplier, from the name of the person who proposed it for the first time in 1968 [56]. Even if implemented with bipolar transistors in its original version, the Gilbert multiplier can also be realized in CMOS technology by means of MOS transistors working in their exponential or weak inversion region.

The Gilbert cell behavior is described by the following input-output equation:

$$
\begin{equation*}
I_{z, i, j}=I_{z} \cdot \frac{I_{x, i}}{I_{x}} \cdot \frac{I_{y, j}}{I_{y}} \tag{A.55}
\end{equation*}
$$

where $I_{x} \doteq \sum_{i} I_{x, i}, I_{y} \doteq \sum_{j} I_{y, j}$ and $I_{x} \doteq \sum_{i} \sum_{j} I_{z, i, j}=I_{x}$. Thus the Sum-Product module computes the products of the two probability mass functions $p_{X}(i) \doteq I_{x, i} / I_{x}$ and $p_{Y}(j) \doteq$ $I_{y, j} / I_{y}$.

## A.4.2 Soft-gates

The different computational modules needed for analog decoding can be easily obtained from the basic Sum-Product cell by an adequate choice of the $\{0,1\}$ valued function $f$.

In particular, if the function $f$ is defined as $f(x, y, z)=1$ if and only if $z=x \oplus y$ with $\oplus$ denoting the standard module-2 addition and $f(x, y, z)=0$ otherwise, the Sum-Product module becomes a soft-XOR gate. If $p_{X}$ and $p_{Y}$ are the distributions of two independent


Figure A.21: Soft-XOR trellis and circuit implementation
binary random variables $X$ and $Y$, respectively, then the distribution of $X \oplus Y, p_{Z}$, is given by:

$$
\left[\begin{array}{l}
p_{z}(0)  \tag{A.56}\\
p_{z}(1)
\end{array}\right]=\left[\begin{array}{l}
p_{x}(0) p_{y}(0)+p_{x}(1) p_{y}(1) \\
p_{x}(0) p_{y}(1)+p_{x}(1) p_{y}(0)
\end{array}\right]
$$

The corresponding circuit implementation, together with the trellis diagram, is reported in Fig.A.21.

It is worth to highlight the biunique correspondence between the $\{0,1\}$ valued function $f$ and its trellis diagram, which uniquely defines the Sum-Product module.

If the function $f$ is equal to 1 if and only if $x=y=z$ and $f(x, y, z)=0$ otherwise, the Sum-Product module realizes an equal gate. The output probability distribution can be computed as:

$$
\left[\begin{array}{c}
p_{z}(0)  \tag{A.57}\\
p_{z}(1)
\end{array}\right]=\gamma\left[\begin{array}{l}
p_{x}(0) p_{y}(0) \\
p_{x}(1) p_{y}(1)
\end{array}\right]
$$

where $\gamma$ is a scale factor to satisfy $p_{Z}(0)+p_{Z}(1)=1$.
Both soft gates can be obtained from a generic Sum-Product module, as the one shown in Fig.A.22, by a proper configuration of the interconnections. The paths without a corresponding branch in the trellis are connected together to a dummy transistor (not shown in the figure). The two bias voltages and the cell bias current are chosen so as to allow the transistors to work under their weak inversion region.


Figure A.22: Transistor network implementation of a general Sum-Product module

## Bibliography

[1] R. King, "Averting the it energy crunch," Online, 2007.
[2] C. Berroux, A. Glavieux, and P. Thitimjshima, "Near Shannon limit errorcorrecting coding and decoding: Turbo codes," in IEEE Proc. of International Conference on Communications, Geneva, Switzerland, May 1993, pp. 1064-1070.
[3] Third Generation Patnership Project (3GPP), 3G TS 25.212, v3.5.0, Multiplexing and Channel Coding (FDD), 2000.
[4] Digital Video Bradcasting (DVB). Interaction channel for satellite distribution systems, ETSI EN 301 790, v1.2.2, 2000.
[5] G. Aiello and G. Rogerson, "Ultra-wideband wireless systems," IEEE Microwaves Magazine, vol. 4, no. 2, pp. 36-47, June 2003.
[6] H.-A. Loeliger, F. Lustenberg, M. Helfenstein, and F. Taröy, "Probability propagation and decoding in analog VLSI," IEEE Trans. Inform. Theory, vol. 47, no. 2, pp. 837-843, February 2001.
[7] J. Hagenauer and M. Winklhofer, "The analog decoder," in Information Theory, 1998. Proceedings. 1998 IEEE International Symposium on, August 1998, p. 145.
[8] F. Lustenberg, M. Helfenstein, G. Moschytz, H.-A. Loeliger, and F. Taröy, "Allanalog decoder for a binary $(18,9,5)$ tail-biting trellis code," in Proc. Eur. Solid State Circuits Conf., Duisburg, Germany, September 1999, pp. 362-365.
[9] M. Moerz, T. Gabara, R. Yan, and J. Hagenauer, "An analog $0.25 \mu \mathrm{~m}$ BiCMOS tailbiting MAP decoder," in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, February 2000.
[10] F. Lustenberger, "On the design of analog VLSI iterative decoders," Ph.D. dissertation, ETH, Zurich, 2000.
[11] M. Moerz, "Analog sliding window decoder core for mixed signal Turbo decoder," in Proc. ITG Conf. Source and Channel Coding, February 2004, pp. 63-70.
[12] M. Arzel, F. Seguin, C. Lahuec, and M. Jézéquel, "Semi-iterative analog Turbo decoding," in Proc. Int. Symp. Circ. and Syst., ISCAS'06.
[13] V. Gaudet and P. Gulak, "A $13.3-\mathrm{Mb} / \mathrm{s} 0.35-\mu \mathrm{m}$ CMOS analog Turbo decoder IC with a configurable interleaver," IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 122-131, January 2004.
[14] D. Vogrig, A. Gerosa, A. Neviani, A. G. i Amat, G. Montorsi, and S. Benedetto, "A $0.35-\mu \mathrm{m}$ CMOS analog Turbo decoder for 40 -bit rate $1 / 3$ UMTS channel code," IEEE J. Solid-State Circuits, vol. 40, no. 3, pp. 735-762, March 2005.
[15] A. G. i Amat, G. Montorsi, and F. Vatta, "Analysis and design of rate compatible serial concatenated convolutional code," in Proc. IEEE Int. Symp. Inform. Theory, ISIT'05, September 2005, pp. 607-611.
[16] C. Winstead, J. Dai, W. Kim, S. Little, Y.-B. Kim, C. Mayers, and C. Schlegel, "Analog MAP decoder for $(8,4)$ Hamming code in subthhreshold CMOS," in Proc. of Advanced Research in VLSI Conf., March 2001, pp. 132-147.
[17] G. Montorsi and S. Benedetto, "Design of fixed-point iterative decoders for concatenated codes with interleavers," IEEE Journal on Select Areas in Communications, vol. 19, no. 5, pp. 871-882, May 2001.
[18] C. Winstead, C. Myers, C. Schlegel, and R. Harrison, "Analog decoding of product codes," in Proc. of Information Theory Workshop, Cairns, Australia, September 2001, pp. 131-133.
[19] G. Haller and B. Wooley, "A 700-MHz switched-capacitor analog waveform sampling circuit," IEEE J. Solid-State Circuits, vol. 29, no. 4, pp. 500-508, 1994.
[20] J. Krupar, R. Srowik, J. Schreiter, A. Graupner, R. Schüffny, and U. Jörges, "Minimizing charge injection errors in high-precision, high-speed SC-circuits," in Proc. Int. Symp. Circ. and Syst., ISCAS'01, vol. 1, May 2001, pp. 727-730.
[21] D. Johns and K. Martin, Analog Integrated Circuit Design, J. Wiley and Sons, Eds., 1997.
[22] J. Rabay, A. Chandrakasan, and B. Nikolić, Digital Integrated Circuits, second edition ed., C. Sodini, Ed. Prentice Hall, 2003.
[23] J. Deveugele, G. V. der Plas, M. Steyaert, G. Gielen, and W. Sansen, "A gradienterror and edge-effect tolerant switching scheme for a high-accuracy DAC," IEEE Trans. on Circuits and Systems, vol. 51, no. 1, pp. 191-195, 2004.
[24] S. Massaro, "Progettazione di una memoria RAM analogica per un decodificatore Turbo ibrido," Ph.D. dissertation, Universita’ di Padova, Padova, 2005.
[25] M. Bickerstaff, D. Garret, T. Prokop, C. Thomas, B. Widdup, G. Zhou, L. Davis, G. Woodward, C. Nicol, and R.-H. Yan, "A unified Turbo/Viterbi channel decoder for 3GPP mobile wireless in $0.18 \mu \mathrm{~m}$ CMOS," IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1555-1564, 2002.
[26] P. Cappelletti, C. Golla, P. Olivo, and E. Zanoni, Flash Memories, K. A. Publishers, Ed. Kluwer Academic Publishers, 1999.
[27] R. Bez, E. Camarlenghi, A. Modelli, and A. Visconti, "Introduction to flash memory," Proc. IEEE, vol. 91, no. 4, pp. 489-502, April 2003.
[28] B. Ricco, G. Torelli, M. Lanzoni, A. Manstretta, H. Maes, D. Montanari, and A. Modelli, "Nonvolatile multilevel memories for digital applications," Proc. IEEE, vol. 86, no. 12, pp. 2399-2423, December 1998.
[29] S.-P. Sim, W. Kwon, C. Lee, J. I. Han, W. Lee, C. Jung, H. Lee, Y. Jang, S. Park, J. Park, C.-K. Park, K. Kim, and K. Kim, "A 90nm generation NOR flash multilevel cell (MLC) with $0.44 \mu \mathrm{~m}^{2} / \mathrm{bit}$ cell size," in IEEE VLSI-TSA International Symposium on VLSI Technology, April 2005, pp. 35-36.
[30] G. Servalli, D. Brazzelli, E. Camerlenghi, G. Capetti, S. Costantini, C. Cupeta, D. D. Simone, A. Ghetti, T. Ghilardi, P. Gulli, M. Mariani, A. Pavan, and R. Somaschini, "A 65 nm NOR flash technology with $0.042 \mu \mathrm{~m}^{2}$ cell size for high performance multilevel application," in IEEE International Electron Devices Meeting, 2005, December 2005, pp. 849-852.
[31] S. Lee, Y.-T. Lee, W.-K. Han, D.-H. Kim, M.-S. Kim, S.-H. Moon, H. Cho, J.-W. Lee, D.-S. Byeon, Y.-H. Lim, H.-S. Kim, S.-H. Hur, and K.-D. Suh, "A 3.3 V 4 Gb four-level NAND flash memory with 90 nm CMOS technology," in Proc. IEEE Int. Solid-State Circuits Conf., vol. 1, San Francisco, CA, February 2004, pp. 52-513.
[32] T. Hara, K. Fukuda, K. Kanazawa, N. Shibata, K. Hosono, H. Maejima, M. Nakagawa, T. Abe, M. Kojima, M. Fujiu, Y. Takeuchi, K. Amemiya, M. Morooka, T. Kamei, H.Nasu, C.-M. Wang, K. Sakurai, N. Tokiwa, H. Waki, T. Maruyama, S. Yoshikawa, M. Higashitani, T. Pham, Y. Fong, and T. Watanabe, "A $146-\mathrm{mm}^{2}$ 8-Gb multi-level NAND flash memory with 70-nm CMOS technology," IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 161-169, January 2006.
[33] G. Atwwood, A. Fazio, D. Mills, and B. Reaves, "Intel strataflash ${ }^{\text {tm }}$ memory technology overview," in Intel Technology Journal, 4th Quarter 1997, pp. 1-8.
[34] P. Pavan, R. Bez, P. Olivo, and E. Zanoni, "Flash memory cell - An overview," Proc. IEEE, vol. 85, no. 8, pp. 1248-1271, August 1997.
[35] N. Shibata, H. Maejima, K. Isobe, K. Iwasa, M. Nakagawa, M. Fujiu, T. Shimizu, M. Honma, S. Hoshi, T. Kawaaiand, K. Kanebako, S. Yoshikawa, H. Tabata, A. Inoue, T. Takahashi, T. Shano, Y. Komatsu, K. Nagaba, M. Kosakai, N. Motohashi, K. Kanazawa, K. Imamiya, H. Nakai, M. Lasser, M. Murin, A. Meir, A. Eyal, and M. Shlick, "A 70 nm 16 Gb 16-level-cell NAND flash memory," IEEE J. SolidState Circuits, vol. 43, no. 4, pp. 929-937, April 2008.
[36] R. Micheloni, R. Ravasio, A. Marelli, E. Alice, V. Altieri, A. Bovino, L. Crippa, E. D. Martino, L. D'Onofrio, A. Gambardella, E. Grillea, G. Guerra, D. Kim, C. Missiroli, I. Motta, A. Prisco, G. Ragone, M. Romano, M. Sangalli, P. Sauro, M. Scotti, and S. Won, "A 4Gb 2b/cell NAND flash memory with embedded 5b BCH ECC for 36MB/s system read throughput," in ISSCC Dig. Tech. Papers, February 2008, pp. 497-506.
[37] E. Berlekamp, Algebraic Coding Theory, rev. ed. ed., 1984.
[38] J. Massey, "Shift register syntesis and BCH decoding," IEEE Trans. Inform. Theory, vol. IT-15, no. 1, pp. 122-127, January 1969.
[39] R. Blahut, Theory and Practice of Error Control Codes, A.-W. P. Company, Ed. Reading, MA, 1983.
[40] R. Gallager, "Low density parity check codes," IRE Trans. Inform. Theory, vol. IT, pp. 21-28, January 1962.
[41] C. Winstead, J. Dai, S. Yu, C. Mayers, R. Harrison, and C. Schlegel, "CMOS analog MAP decoder for $(8,4)$ Hamming code," IEEE J. Solid-State Circuits, vol. 39, no. 1, pp. 122-131, January 2004.
[42] S. Gregori, A. Cabrini, O. Khouri, and G. Torelli, "On-chip error correction techniques for new-generation flash memories," Proc. IEEE, vol. 91, no. 4, pp. 602616, April 2003.
[43] A. Xotta, A. Gerosa, and A. Neviani, "CMOS implementation of all-analogue APP decoders: analysis of performance and limitations," Electronics letters, vol. 37, no. 25, pp. 1501-1503, December 2001.
[44] J. Cain, G. Clark, and J. Geist, "Punctured convolutional codes of rate $(n-1) / n$ and simplified maximum likehood decoding," IEEE Trans. Inform. Theory, vol. 25, no. 1, pp. 97-100, January 1979.
[45] D. Haccoun and G. Bégin, "High-rate punctured convolutional codes for Viterbi and sequential decoding," IEEE Trans. Commun., vol. 37, no. 11, pp. 1113-1125, November 1989.
[46] G. Bégin, D. Haccoun, and C. Paquin, "Further results on high-rate punctured convolutional codes for Viterbi and sequential decoding," IEEE Trans. Commun., vol. 38, no. 11, pp. 1922-1928, November 1990.
[47] A. G. i Amat, G. Montorsi, and S. Benedetto, "Design and decoding of optimal high-rate convolutional codes," IEEE Trans. Inform. Theory, vol. 50, no. 5, pp. 867-881, May 2004.
[48] C. Villa, D. Vimercati, S. Schippers, E. Confalonieri, M. Sforzin, S. Polizzi, M. L. Placa, C. Lisi, A. Magnavacca, E. Bolandrina, A. Martinelli, V. Dima, A. Scavuzzo, B. Calandrino, N. D. Gatto, M. Scardaci, F. Mastroianni, M. Pisasale, A. Geraci, M. Gaitiotti, and M. Sali, "A 125 MHz burst-mode flexible read-while-write 256 Mbit 2b/c 1.8V NOR flash memory," in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, February 2005.
[49] H.-L. Lou and C.-E. Sundberg, "Increasing storage capacity in multilevel memory cells by means of communications and signal processing techniques," IEE Proc.Circuits Devices Syst., vol. 147, no. 4, pp. 229-236, August 2000.
[50] F. Sun, S. Devarajan, K. Rose, and T. Zhang, "Design of on-chip error correction systems for multilevel NOR and NAND flash memories," IET Circuits Devices Syst., vol. 1, no. 3, pp. 241-249, June 2007.
[51] G. Ungerboeck, "Channel coding with multilevel/phase signals," IEEE Trans. Inform. Theory, vol. IT-28, no. 1, pp. 55-67, January 1982.
[52] __, "Trellis-coded modulation with redundant signal sets Part I: Introduction," IEEE Commun. Mag., vol. 25, no. 2, pp. 5-11, February 1987.
[53] __, "Trellis-coded modulation with redundant signal sets Part II: State of the art," IEEE Commun. Mag., vol. 25, no. 2, pp. 12-21, February 1987.
[54] J. Hagenauer and P. Hoher, "A Viterbi algorithm with soft-decision outputs and its applications," in Proc. 1989 IEEE Global Telecomm. Conf. (GLOBECOM'89), Dallas, Texas, November 1989, pp. 47.1.1-47.1.7.
[55] J. Hagenauer, P. Robertson, and L. Papke, "Iterative (TURBO) decoding of systematic convolutional codes with the MAP and SOVA algorithms," in IEEE Proc. ITG Fachtagung, Munich, Germany, October 1994, pp. 21-29.
[56] B. Gilbert, "A precise four-quadrant multiplier with subnanosecond response," IEEE J. Solid-State Circuits, vol. 3, no. 4, pp. 365-373, December 1968.
[57] A. Xotta, "Implementazione analogica di turbo decoder CMOS per applicazioni hard disk," Ph.D. dissertation, Universita’ di Padova, Padova, 2002.
[58] C. Mead, Analog VLSI and Neural Systems, ser. Addison Wesley Computation and Neural Systems Series, A. Wesley, Ed. Addison Wesley, 1989.
[59] M. Polgrom, A. Duinmaijer, and A. Welbers, "Matching properties of MOS transistors," IEEE J. Solid-State Circuits, vol. 21, no. 5, pp. 1433-1440, October 1989.
[60] R. Gregor, "On the relationship between topography and transistor matching in an analog CMOS technology," IEEE Trans. on Electron Devices, vol. 39, no. 2, pp. 275-282, February 1992.
[61] F. Lustenberger and H.-A. Loeloger, "On mismatch errors in analog-VLSI error correcting decoders," in IEEE Proc. of International Symposium on Circuits and Systems, vol. 4, Sydney, Australia, May 2001, pp. 198-201.
[62] J. Dai, "Design methodology for analog VLSI implementations of error control decoders," Ph.D. dissertation, University of Utah, Salt Lake City, Utah, 2002.
[63] M. Frey, H.-A. Loeliger, and P. Merkli, "Analog circuits for symbol-likelihood computation," in Proc. 13th IEEE Int. Conf. on Electronics, Circuits and Systems, Nice, France, December 2006.
[64] First Report and Order, FCC 02-48, Federal Communications Commission, ET Docket 98-153, February 2002.
[65] Second Report and Order and Second Memorandum Opinion and Order, FCC 04285, Federal Communications Commission, ET Docket 98-153, December 2004.
[66] Code of Federal Regulations, Federal Communications Commission, Title 47, Chapter 1, Part 15, October 2004.
[67] J. Han and C. Nguyen, "Ultra-wideband electronically tunable pulse generators," IEEE Microwave and Wireless Components Letters, vol. 14, no. 3, pp. 112-114, March 2004.
[68] J. Gerritis and J. Farserotu, "Wavelet generation circuit for UWB impulse radio applications," Electronics Letters, vol. 38, no. 25, pp. 1737-1738, December 2002.
[69] J. Foerster, E. Green, S. Somayazulu, and D. Leeper, "Ultra-wideband technology for short- or medium-range wireless communication," in Intel Technology Journal, 2nd Quarter 2001.
[70] I. Oppermann, M. Hämäläinem, and J. Iinatti, UWB Theory and Applications, Wiley, Ed., 2004.
[71] D. Wentzloff, R. Blazquez, F. Lee, B. Ginsburg, J. Powell, and A.P.Chandrakasan, "System design considerations for ultra-wideband communication," IEEE Comтиnications Magazine, vol. 43, no. 8, pp. 114-121, August 2005.
[72] W. Kluge, F. Poegel, H. Roller, M. Lange, T. Ferchland, L. Dathe, and D. Eggert, "A fully integrated $2.4-\mathrm{GHz}$ IEEE 802.15.4-compliant transceiver for $\mathrm{ZigBee}^{\mathrm{tm}}$ applications," IEEE J. Solid-State Circuits, vol. 41, no. 12, pp. 2767-2775, December 2006.
[73] M. Camus, B. Butaye, L. Garcia, M. Sie, B. Pellat, and T. Parra, "A $5.4 \mathrm{~mW} / 0.07$ $\mathrm{mm}^{2} 2.4 \mathrm{GHz}$ front-end receiver in 90 nm CMOS for IEEE 802.15.4 WPAN standard," IEEE J. Solid-State Circuits, vol. 43, no. 6, pp. 1372-1383, June 2008.
[74] N. Stanic, A. Balankutty, P. Kinget, and Y. Tsividis, "A 2.4-GHz ISM-band slidingIF receiver with a $0.5-\mathrm{V}$ supply," IEEE J. Solid-State Circuits, vol. 43, no. 5, pp. 1138-1145, May 2008.
[75] M. Ranjan and L. Larson, "A low-cost and low-power CMOS receiver front-end for MB-OFDM ultra-wideband systems," IEEE J. Solid-State Circuits, vol. 42, no. 3, pp. 592-601, March 2007.
[76] G. Cusmai, M. Brandolini, P. Rossi, and F. Svelto, "A $0.18 \mu \mathrm{~m}$ CMOS selective receiver front-end for UWB applications," IEEE J. Solid-State Circuits, vol. 41, no. 8, pp. 1764-1771, August 2006.
[77] D. Wentzloff and A. Chandrakasan, "A 47pJ/pulse 3.1-to-5GHz all-digital UWB transmitter in 90nm CMOS," in ISSCC Dig. Tech. Papers, February 2007, pp. 118591.
[78] S. Dubouloz, A. Rabbachin, S. de Rivaz, B. Denis, and L. Ouvry, "Performance analysis of low complexity solutions for UWB low data rate impulse radio," in Proc. Int. Symp. Circ. and Syst., ISCAS'06, May 2006.
[79] L. Stoica, A. Rabbachin, H. Repo, S. Tiuraniemi, and I. Oppermann, "An ultrawidebena system architecture for tag based wireless sensor networks," IEEE Transactions on Vehicular Technology, vol. 54, pp. 1632-1645, September 2005.
[80] "IEEE 802.15.4a wireless MAC and PHY specifications for LR-WPANs," Online, 2007.
[81] M. Win and R. Scholtz, "Impulse radio: how it woks," IEEE Communications Letters, vol. 2, no. 2, pp. 36-38, February 1998.
[82] W. Hirt, "The european UWB radio regulatory and standards framework: Overview and implications," in Ultra-Wideband, 2007. ICUWB 2007. IEEE International Conference on, September 2007, pp. 733-738.
[83] R. Scholtz and M. Win, "Impulse radio," in Wireless Communications: TDMA vs. CDMA, S. Glisic and P. Leppänen, Eds. Norwell, MA: Kluwer, 1997, pp. 245264.
[84] M. Ho, L. Taylor, and G. Aiello, "UWB architecture for wireless video networking," in Proc. IEEE Int. Conf. Consumer Electronics, June 2001, pp. 18-19.
[85] R. Fontana, A. Ameti, E. Richley, L. Beard, and D. Guy, "Recent advances in ultra wideband communications systems," in IEEE Conf. UWB Systems and Technologies Dig., May 2002, pp. 129-133.
[86] Y. Nakache and A. Molisch, "Spectral shaping of UWB signals for time hopping impulse radio," IEEE Journal on Selected Areas in Communications, vol. 24, no. 4, pp. 738-744, April 2006.
[87] E. Barajas, R. Cosculluela, D. Coutinho, M. Molina, D. M. amd J.L. Gonzalez, I. Cairo, S. Banda, and M. Ikeda, "A low-power template generator for coherent impulse-radio ultra wide-band receivers," in IEEE International Conference on Ultra-Wideband, September 2006, pp. 97-102.
[88] B. Jung, Y.-H. Tseng, J. Harvey, and R. Harjani, "Pulse generator design for UWB IR communication systems," in IEEE International Symposium on Circuits and Systems, vol. 5, May 2005, pp. 4381-4384.
[89] _—, "A novel CMOS/BiCMOS UWB pulse generator and modulator," in IEEE International Microwave Symposium, vol. 2, June 2004, pp. 1269-1272.
[90] ——, "Low-power full-band UWB active pulse shaping circuit using $0.18-\mu \mathrm{m}$ CMOS technology," in IEEE Radio Frequency Integrated Circuits Symposium, June 2006, pp. 510-513.
[91] S. Iida, K. Tanaka, H. Suzuki, N. Yoshikawa, N. Shoji, B. Griffiths, D. Mellor, F. Hayden, I. Butler, and J. Chatwin, "A 3.1 to 5 GHz CMOS DSSS UWB transceiver for WPANs," in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, February 2005, pp. 214-594.
[92] Y. Choi, "Gated UWB pulse signal generation," in IEEE Conference on Ultra Wideband Systems and Technologies, May 2004, pp. 122-124.
[93] M. Welborn, "System considerations for ultra-wideband wireless networks," in IEEE Radio and Wireless Conference, August 2001, pp. 5-8.
[94] X. Chen and S. Kiaei, "Monocycle shapes for ultra wideband systems," in IEEE International Symposium on Circuits and Systems, vol. 1, May 2002, pp. 597-600.
[95] F. Ramirez-Mirales, "On the performance of ultra-wideband signals in gaussian noise and dense multipath," IEEE Transactions on Vehicular Technology, vol. 50, no. 1, pp. 244-249, January 2001.
[96] J. Proakis and M. Salehi, Digital Communications, 5th ed., M. Hill, Ed. McGrawHill Science Engineering, 2007.
[97] W. Jones and J. Dill, "The square root raised cosine wavelet and its relation to the meyer functions," IEEE Transactions on Signal Processing, vol. 49, no. 1, pp. 248-251, January 2001.
[98] S. Chennakeshu and G. Saulnier, "Differential detection of pi/4-shifted-dqpsk for digital cellular radio," IEEE Transactions on Vehicular Technology, vol. 42, no. 1, pp. 46-57, February 1993.
[99] H. Hashemi, "The indoor radio propagation channel," Proc. IEEE, vol. 81, no. 7, pp. 943-968, July 1993.
[100] S. Ghassemzadeh, L. Greenstein, A. Kavicic, T. Sveinsson, and V. Tarokh, "UWB indoor path loss model for residential and commercial buildings," in IEEE Transactions on Vehicular Technology Conference, vol. 5, October 2003, pp. 3120-3125.
[101] F. Lee and A. Chandrakasan, "A $2.5 \mathrm{~nJ} /$ bit 0.65 V pulsed UWB receiver in 90 nm CMOS," IEEE J. Solid-State Circuits, vol. 42, no. 12, pp. 2851-2859, December 2007.
[102] Y. Tong, Y. Zheng, and Y.-P. Xu, "A coherent ultra-wideband receiver IC system for WPAN application," in Ultra-Wideband, 2005. ICU 2005. IEEE International Conference on, September 2005, pp. 60-64.
[103] H. Danneels, M. Verhelst, P. Palmers, W. Vereecken, B. Boury, W. Dehaene, M. Steyaert, and G. Gielen, "A low-power mixing DAC IR-UWB receiver," in Proc. Int. Symp. Circ. and Syst., ISCAS'08, vol. 5, May 2008, pp. 2607-2700.
[104] D. Barras, F. Ellinger, H. Jackel, and W. Hirt, "A robust front-end architecture for low-power UWB radio transceivers," in Microwave Theory and Techniques, IEEE Transactions on, vol. 54, no. 4, June 2006, pp. 1713-1723.
[105] A. Demosthenous and M. Panovic, "Low-voltage MOS linear transconductor/squarer and four-quadrant multiplier for analog VLSI," Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 52, no. 9, pp. 1721-1731, September 2005.
[106] D. Wentzloff, F. Lee, D. Daly, M. Bhardwaj, P. Mercier, and A. Chandrakasan, "Energy efficient pulsed-UWB CMOS circuits and systems," in Ultra-Wideband, 2007. ICUWB 2007. IEEE International Conference on, September 2007, pp. 282287.
[107] A. Rabbachin and I. Oppermann, "Synchronization analysis for UWB systems with a low-complexity energy collection receiver," in Ultra Wideband Systems, 2004. Joint with Conference on Ultrawideband Systems and Technologies. Joint UWBST \& IWUWBS. 2004 International Workshop on, May 2004, pp. 288-292.
[108] M. Crepaldi, M. Casu, M. Graziano, and M. Zamboni, "A 1-bit synchronization algorithm for a reduced complexity energy detection UWB receiver," in UltraWideband, 2007. ICUWB 2007. IEEE International Conference on, September 2007, pp. 703-708.
[109] M. Weisenborn and W. Hirt, "Robust non-coherent receiver exploiting UWB channel properties," in Proc. of UWBST-IWUWBS, May 2004, pp. 156-160.
[110] R. Blazquez, P. Newaskar, and A. Chandrakasan, "Coarse acquisition for ultra wideband digital receivers," in Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on, vol. 4, April 2003, pp. 137-140.
[111] D. Wentzloff and A. Chandrakasan, "Gaussian pulse generators for subbanded ultra-wideband transmitters," IEEE Trans. Microwave Theory Tech., vol. 54, pp. 1647-1655, June 2006.
[112] F. Lee and A. Chandrakasan, "A 0.65-to-1.4nJ/burst 3-to-10 GHz UWB all-digital TX in 90nm CMOS for IEEE 802.15.4a," IEEE J. Solid-State Circuits, vol. 42, no. 12, pp. 2860-2869, December 2007.
[113] D. Wentzloff and A. Chandrakasan, "A 47pJ/pulse 3.1-to-5GHz all-digital UWB transmitter in 90nm CMOS," in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, February 2007, pp. 118-591.
[114] P. Mercier, D. Daly, and A. Chandrakasan, "A 19pJ/pulse UWB transmitter with dual capacitively-coupled digital power amplifiers," in Radio Frequency Integrated Circuits Symposium, 2008. RFIC 2008. IEEE, June 2008, pp. 47-50.
[115] I. Aoki, S. Kee, D. Rutledge, and A. Hajimiri, "Distributed active transformer A new power-combining and impedance-transformation technique," IEEE Trans. Microwave Theory and Techniques, vol. 50, no. 1, pp. 316-331, January 2002.
[116] G. Matthaei, L. Young, and E. Jones, Microwave filters, impedance-matching networks, and coupling structures, McGraw-Hill, Ed., 1964.
[117] S. Mohan, C. Yue, M. del Mar Hershenson, S. Wong, and T. Lee, "Modeling and characterization of on-chip transforms," in Proc. IEDM, December 1998, pp. 531534.
[118] J. Long, "Monolithic transformers for silicon RF IC design," IEEE J. Solid-State Circuits, vol. 35, no. 9, pp. 1368-1382, September 2000.
[119] Y. Cao, R. Groves, X. Huang, N. Zamdmer, J.-O. Plouchart, R. Wachnik, T.-J. King, and C. Hu, "Frequency-independet equivalent-circuit model for on-chip spiral inductors," IEEE J. Solid-State Circuits, vol. 38, no. 3, pp. 419-426, March 2003.
[120] I. Lai and M. Fujishima, "A new on-chip substrate coupled inductor model implemented with scalable expressions," IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2491-2499, November 2006.
[121] C. Shannon, "A mathematical theory of communication," Bell Systems Technical Journal, vol. 27, pp. 379-423 (part I), 623-656 (part II), July 1984.
[122] S. Benedetto and G. Montorsi, "Unveiling Turbo codes: some results on parallel concatenated coding schemes," IEEE Trans. Inform. Theory, vol. 42, pp. 409-428, March 1996.
[123] T. Richardson, A. Shokrollahi, and R. Urbanke, "Design of provably good lowdensity parity check codes," in Proc. IEEE Int. Symp. on Inform, theory, Sorrento, Italy, June 2000, p. 199.
[124] R. Gallager, "Low density parity check codes," IRE Trans. Inform. Theory, vol. IT, pp. 21-28, January 1962.
[125] R. Hamming, "Error detecting and error correcting code," Bell Systems Technical Journal, vol. 26, pp. 147-150, July 1950.
[126] C. Schlegel, Trellis Coding, 1997.
[127] H. Ma and J. Wolf, "On tail baiting convolutional codes," IEEE Trans. Communications, vol. COM-34, no. 2, pp. 104-11, February 1986.
[128] H. Imai and S. Hirakawa, "A new multilevel coding method using error-correcting codes," IEEE Trans. Inform. Theory, vol. IT-23, no. 3, pp. 371-377, May 1977.
[129] S. Benedetto and G. Montorsi, "Performance evaluation of Turbo codes," Electronics Letters, vol. 31, no. 3, pp. 163-165, February 1995.
[130] S. Benedetto, D. Divslar, G. Montorsi, and F. Pollara, "Serial concatenation of interleaved codes: Performance analyisis, design and iterative decoding," IEEE Trans. Inform. Theory, vol. 44, no. 3, pp. 909-926, May 1998.
[131] A. Viterbi, "Error bounds for convolutional codes and asymptotically optimum decoding algorithm," IEEE Trans. Inform. Theory, vol. 13, no. 2, pp. 260-269, April 1967.
[132] A. Viterbi, J. Wolf, E. Zehavi, and R. Padovani, "A pragmatic approach to trelliscoded modulation," IEEE Commun. Mag., vol. 27, no. 7, pp. 11-19, July 1989.
[133] F. Kschischang, B. Frey, and H. Loeliger, "Factor graphs and the sum-product algorithm," IEEE Trans. Inform. Theory, vol. 17, no. 2, pp. 498-519, February 2001.
[134] R. Tanner, "A recursive approach to low complexity codes," IEEE Trans. Inform. Theory, vol. 27, no. 5, pp. 533-547, September 1981.
[135] S. Benedetto and G. Montorsi, "Iterative decoding of serially concatenated convolutional codes," Electronics Letters, vol. 32, no. 16, pp. 1186-1188, June 1996.
[136] S. Benedetto, D. Divslar, G. Montorsi, and F. Pollara, "Analysis, design and iterative decoding of double serial concatenated codes with interleavers," IEEE Journal on Selected Areas in Communications, vol. 16, no. 2, pp. 231-244, February 1998.
[137] S. Lin and D. Costello, Error Control Coding: Fundamentals and Applications, Prentice-Hall, Ed., 1983.
[138] J. Hagenauer, E. Offer, and L. Papke, "Iterative decoding of binary block and convolutional codes," IEEE Trans. Inform. Theory, vol. 12, no. 2, pp. 429-445, March 1996.
[139] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, "Optimal decoding of linear codes for minimizing symbol error rate," IEEE Trans. Inform. Theory, vol. 20, pp. 284-287, March 1974.
[140] J. Hagenauer, "Decoding of binary codes with analog network," in Proc. IEEE Int. Symp. on Information Theory, San Diego, CA, February 1998, pp. 13-14.

