high-performance and energy-efficient decoder design for

133
University of California Los Angeles High-Performance and Energy-Efficient Decoder Design for Non-Binary LDPC Codes A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering by Yuta Toriyama 2016

Upload: khangminh22

Post on 10-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

University of California

Los Angeles

High-Performance and Energy-Efficient

Decoder Design for Non-Binary LDPC Codes

A dissertation submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Electrical Engineering

by

Yuta Toriyama

2016

© Copyright by

Yuta Toriyama

2016

Abstract of the Dissertation

High-Performance and Energy-Efficient

Decoder Design for Non-Binary LDPC Codes

by

Yuta Toriyama

Doctor of Philosophy in Electrical Engineering

University of California, Los Angeles, 2016

Professor Dejan Markovic, Chair

Binary Low-Density Parity-Check (LDPC) codes are a type of error correction code

known to exhibit excellent error-correcting capabilities, and have increasingly been applied

as the forward error correction solution in a multitude of systems and standards, such as

wireless communications, wireline communications, and data storage systems. In the pursuit

of codes with even higher coding gain, non-binary LDPC (NB-LDPC) codes defined over a

Galois field of order q have risen as a strong replacement candidate. For codes defined

with similar rate and length, NB-LDPC codes exhibit a significant coding gain improvement

relative to that of their binary counterparts.

Unfortunately, NB-LDPC codes are currently limited from practical application by the

immense complexity of their decoding algorithms, because the improved error-rate perfor-

mance of higher field orders comes at the cost of increasing decoding algorithm complexity.

Currently available ASIC implementation solutions for NB-LDPC code decoders are simul-

taneously low in throughput and power-hungry, leading to a low energy efficiency.

We propose several techniques at the algorithm level as well as hardware architecture level

in an attempt to bring NB-LDPC codes closer to practical deployment. On the algorithm

side, we propose several algorithmic modifications and analyze the corresponding hardware

cost alleviation as well as impact on coding gain. We also study the quantization scheme

ii

for NB-LDPC decoders, again in the context of both the hardware and coding gain impacts,

and we propose a technique that enables a good tradeoff in this space. On the hardware

side, we develop a FPGA-based NB-LDPC decoder platform for architecture prototyping as

well as hardware acceleration of code evaluation via error rate simulations. We also discuss

the architectural techniques and innovations corresponding to our proposed algorithm for

optimization of the implementation. Finally, a proof-of-concept ASIC chip is realized that

integrates many of the proposed techniques. We are able to achieve a 3.7x improvement

in the information throughput and 23.8x improvement in the energy efficiency over prior

state-of-the-art, without sacrificing the strong error correcting capabilities of the NB-LDPC

code.

iii

The dissertation of Yuta Toriyama is approved.

Richard D. Wesel

Gregory J. Pottie

Lara Dolecek

Dejan Markovic, Committee Chair

University of California, Los Angeles

2016

iv

For my parents.

v

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Non-Binary Low-Density Parity-Check Codes . . . . . . . . . . . . . . . . . 2

1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Decoding Algorithms for Non-Binary LDPC Codes and Their Implemen-

tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Binary AWGN Channel . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Probability Domain Decoding . . . . . . . . . . . . . . . . . . . . . . 12

2.1.4 FFT-QSPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.5 Min-Max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 The Pruned Min-Max Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Figure-of-Merit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Parameters and Assumptions . . . . . . . . . . . . . . . . . . . . . . 26

3.1.2 The Fully Parallel Architecture . . . . . . . . . . . . . . . . . . . . . 27

3.2 Algorithm Strategy: Pruned Min-Max Decoding . . . . . . . . . . . . . . . . 33

3.2.1 Derivation of the Proposed Simplification . . . . . . . . . . . . . . . . 33

3.2.2 Analysis of Decoding Performance . . . . . . . . . . . . . . . . . . . . 36

3.2.3 Cost Analysis of the Pruned Min-Max Algorithm . . . . . . . . . . . 40

4 Logarithmic Quantization Scheme for the Min-Max Algorithm . . . . . . 43

vi

4.1 Prior Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Derivation of Computational Complexity of the Min-Max Algorithm . 46

4.2.2 Routing Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Quantization Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 The Logarithmic Quantization Scheme . . . . . . . . . . . . . . . . . . . . . 52

4.4.1 The Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.2 Error Rate Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.3 Estimated Computational Complexity . . . . . . . . . . . . . . . . . 54

5 Implementation of FPGA Platform for Code Performance Evaluation . 60

5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.1 Hardware Resource Utilization on FPGA . . . . . . . . . . . . . . . . 68

5.3.2 Frame Error Rate Simulations . . . . . . . . . . . . . . . . . . . . . . 72

6 Augmented Hard-Decision Based Decoding Algorithm and Combination

with Soft Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.1 Iterative Hard-Decision Based Majority Logic Decoding . . . . . . . . . . . . 76

6.1.1 Augmented IHRB-MLGD . . . . . . . . . . . . . . . . . . . . . . . . 77

6.1.2 Detection of Erasure Condition . . . . . . . . . . . . . . . . . . . . . 79

6.2 Software Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.3 Combination with the Min-Max Algorithm . . . . . . . . . . . . . . . . . . . 83

7 ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.1 Parity-Check Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

vii

7.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.2.1 Variable Node Architecture . . . . . . . . . . . . . . . . . . . . . . . 92

7.2.2 A-IHRB Check Node Logic Implementation in Variable Node . . . . . 94

7.2.3 Decoder Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . 95

7.3 AWGN Channel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.4 Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.5 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.5.1 Error Correction Performance . . . . . . . . . . . . . . . . . . . . . . 101

7.5.2 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.5.3 Comparison Against Prior Art . . . . . . . . . . . . . . . . . . . . . . 107

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

viii

List of Figures

1.1 FEC in communication systems. . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Tanner graph construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Binary AWGN channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Tensor circular convolution example. . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Conceptual diagram of check node computations. . . . . . . . . . . . . . . . 22

3.1 Top-level architecture for a fully parallel decoder. . . . . . . . . . . . . . . . 27

3.2 VNU architectures for the Min-Max decoder. . . . . . . . . . . . . . . . . . . 28

3.3 Implementation of the CNU for the Min-Max algorithm. . . . . . . . . . . . 29

3.4 Architecture of butterfly MUX structure. . . . . . . . . . . . . . . . . . . . . 30

3.5 MIN-MAX computation in a tree architecture. . . . . . . . . . . . . . . . . . 30

3.6 FOM vs. q for the Min-Max decoder. . . . . . . . . . . . . . . . . . . . . . . 31

3.7 Tree representation of proposed Pruned Min-Max. . . . . . . . . . . . . . . . 34

3.8 FER simulation results of Pruned Min-Max. . . . . . . . . . . . . . . . . . . 35

3.9 A non-binary absorbing set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.10 Decoding evolution for Min-Max and Pruned Min-Max decoders. . . . . . . . 39

3.11 CNU architecture implementing the Pruned Min-Max algorithm. . . . . . . . 40

3.12 FOM comparison between the Min-Max and Pruned Min-Max architectures. 41

3.13 FER simulation results for (2, 4) codes in GF(4) and GF(8). . . . . . . . . . 41

4.1 FER curves for GF(16), (dv, dc) = (3, 6), (378, 189) code. . . . . . . . . . . . 49

4.2 Plot of number of symbol errors vs. iteration count. . . . . . . . . . . . . . . 52

4.3 FER curves for GF(16), (dv, dc) = (3, 6), (378, 189) code. . . . . . . . . . . . 54

4.4 FER curves for GF(32), (dv, dc) = (3, 12), (300, 75) code. . . . . . . . . . . . 55

ix

4.5 Normalized computational complexity for Min-Max algorithm implementation. 56

5.1 Quasi-cyclic structure of parity-check matrix. . . . . . . . . . . . . . . . . . . 62

5.2 Partially parallel architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Partially parallel architecture with log quantization. . . . . . . . . . . . . . . 65

5.4 Variable node implementation for FPGA platform. . . . . . . . . . . . . . . 66

5.5 Check node implementation for FPGA platform. . . . . . . . . . . . . . . . . 66

5.6 Automated RTL generation scheme. . . . . . . . . . . . . . . . . . . . . . . . 67

5.7 FPGA accelerated FER simulation. . . . . . . . . . . . . . . . . . . . . . . . 71

5.8 FPGA accelerated FER simulation with log quantization. . . . . . . . . . . . 72

6.1 IHRB-MLGD modification strategy comparison. . . . . . . . . . . . . . . . . 81

6.2 IHRB-MLGD modification strategy comparison. . . . . . . . . . . . . . . . . 82

6.3 Comparison of FER in simulation with and without A-IHRB. . . . . . . . . 84

6.4 Comparison of BER in simulation with and without A-IHRB. . . . . . . . . 84

6.5 Comparison of simulation times with and without A-IHRB. . . . . . . . . . . 85

7.1 Protograph for code of ASIC design. . . . . . . . . . . . . . . . . . . . . . . 88

7.2 Comparison of codes across GF(q). . . . . . . . . . . . . . . . . . . . . . . . 89

7.3 Comparison of 8k codes across GF(q) with max iterations = 6. . . . . . . . . 90

7.4 Comparison of 8k codes across GF(q) with max iterations = 20. . . . . . . . 91

7.5 Varibale node architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.6 Decoder core architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.7 AWGN generator block details. . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.8 Tausworthe PRNG block details. . . . . . . . . . . . . . . . . . . . . . . . . 99

7.9 General task flow for chip tapeout. . . . . . . . . . . . . . . . . . . . . . . . 100

7.10 Lab test setup for chip measurement. . . . . . . . . . . . . . . . . . . . . . . 102

x

7.11 Chip micrograph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.12 Frame error rate chip measurements. . . . . . . . . . . . . . . . . . . . . . . 104

7.13 Average number of iterations per frame chip measurements. . . . . . . . . . 105

7.14 Throughput and power versus operating frequency. . . . . . . . . . . . . . . 106

7.15 Shmoo plot measurements for three received chips. . . . . . . . . . . . . . . 106

7.16 Table of chip measurement results and comparison with prior art. . . . . . . 108

xi

List of Tables

2.1 Summary of Selected Published Works 1 . . . . . . . . . . . . . . . . . . . . 23

3.1 Error Profile Comparison (FER ≈ 10−5) . . . . . . . . . . . . . . . . . . . . 38

4.1 Error Profile of Various Quantization Schemes . . . . . . . . . . . . . . . . . 59

4.2 Quantization Scheme Examples . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Percent Savings in Computational Complexity . . . . . . . . . . . . . . . . . 59

5.1 FPGA Synthesis Results (3,24) L130 P65 . . . . . . . . . . . . . . . . . . . . 69

5.2 FPGA Synthesis Results (3,27) L114 P57 . . . . . . . . . . . . . . . . . . . . 70

5.3 FPGA Synthesis Results (3,30) L104 P104 . . . . . . . . . . . . . . . . . . . 70

7.1 Possible Constant Values for the Tausworthe PRNG . . . . . . . . . . . . . . 98

xii

Acknowledgments

First and foremost I would like to thank my advisor Professor Dejan Markovic for all

of the great support and advice that he has provided. None of my work would have been

possible without his mentorship. I would also like to thank Professor Lara Dolecek, Professor

Gregory Pottie, and Professor Richard Wesel for serving on my doctoral committee. Their

time and guidance have been invaluable.

I would also like to thank my current and former colleagues of the Parallel Data Ar-

chitectures group, including (but not limited to, in no particular order) Hariprasad Chan-

drakumar, Dejan Rozgic, Sina Basir-Kazeruni, Wenlong Jiang, Vahagn Hokhikyan, Alireza

Yousefi, Henry Chen, Dr. Richard Dorrance, Dr. Vaibhav Karkare, Dr. Cheng C. Wang,

Dr. Fang-Li Yuan, Dr. Sarah Gibson, Dr. Tsung-Han Yu, Dr. Rashmi Nanda, Dr. Victoria

Wang, Professor Fengbo Ren, and Professor Chia-Hsiang Yang. In addition I would like to

thank colleagues from other research groups as well, including (but not limited to) Ahmed

Hareedy, Homa Esfahanizadeh, Neha Sinha, Preeti Mulage, Dr. Sean Huang, Dr. Yousr

Ismail, Dr. Henry Park, Dr. Amir Amin Hafez, and Dr. Behzad Amiri. The time spent

discussing various topics with these people has probably been very educational, enlightening,

and inspiring.

The staff of the Electrical Engineering Department, in particular, Kyle Jung, Ryo Arreola,

and Deeona Columbia, have been very helpful behind the scenes, and I highly appreciate

their support.

Most of all, I would like to thank my parents Ichiro and Keiko Toriyama and my sister

Aika Toriyama for their continuous and unconditional support.

xiii

Vita

2001 – 2005 Rancho Bernardo High School, San Diego, California.

2007 Software Engineer Intern, NextWave Broadband, San Diego, CA.

2009 B.S. with High Honors, Electrical Engineering and Computer Sciences,University of California, Berkeley.

2009 – 2016 Graduate Student Researcher, Department of Electrical Engineering,University of California, Los Angeles.

2009 EE Departmental Fellowship,University of California, Los Angeles.

2011 M.S., Electrical Engineering,University of California, Los Angeles.

2011 Hardware Engineer Intern, Broadcom, Irvine, CA.

2013 – 2014 Broadcom Fellowship,University of California, Los Angeles.

2014 Intern, Toshiba Semiconductor and Storage Systems, Yokohama, Kana-gawa Prefecture, Japan.

2015 Teaching Assistant, EE216A: Design of VLSI Circuits and Systems,University of California, Los Angeles.

2016 Teaching Assistant, EE115C: Digital Electronic Circuits,University of California, Los Angeles.

xiv

Publications

Y. Toriyama, B. Amiri, L. Dolecek, and D. Markovic, “Logarithmic Quantization Scheme forReduced Hardware Cost and Improved Error Floor in Non-Binary LDPC Decoders,”in Proc. IEEE Global Comm. Conf., (GLOBECOM14), pp. 3162-3167, Dec. 2014.

Y. Toriyama, B. Amiri, L. Dolecek, and D. Markovic, “Field-Order Based Hardware CostAnalysis of Non-Binary LDPC Decoders,” in Proc. 48th Asilomar Conference on Sig-nals, Systems and Computers, pp. 2045-2049, Nov. 2014.

R. Dorrance, F. Ren, Y. Toriyama, A.A. Hafez, C.-K.K. Yang, D. Markovic, “Scalabilityand Design-Space Analysis of a 1T-1MTJ Memory Cell for STT-RAM,” IEEE Trans.Electron Devices (TED), vol. 59, no. 4, pp. 878-887, Apr. 2012.

F. Ren, H. Park, R. Dorrance, Y. Toriyama, A. Amin, C.-K.K. Yang, D. Markovic, “ABody-Voltage-Sensing-Based Short Pulse Reading Circuit for Advanced Spin-TorqueTransfer RAMs (STT-RAMs),” in Proc. 13th Int. Symp. on Quality Electronic De-sign (ISQED’12), pp. 275-282, Mar. 2012.

R. Dorrance, F. Ren, Y. Toriyama, A. Amin, C.-K.K. Yang, D. Markovic, “Scalability andDesign-Space Analysis of a 1T-1MTJ Memory Cell,” in Proc. ACM/IEEE Int. Symp.on Nanoscale Arch. (NANOARCH’11), pp. 53-58, Jun. 2011.

xv

CHAPTER 1

Introduction

1.1 Non-Binary Low-Density Parity-Check Codes . . . . . . . . . . . . . . . 2

1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

Communication Channel

Channel Coding

Channel Decoding

Information Source

Information Destination

Transmitter

Receiver

Figure 1.1: The use of FEC in communication systems.

1.1 Non-Binary Low-Density Parity-Check Codes

Forward error correction (FEC) is an indispensable technique in any digital system that

communicates data over unreliable or noisy channels by means of sending redundancy. The

ability to detect and correct erroneous data without a need for a backward channel has

enabled the proliferation of high-throughput and low-power communication systems as well

as high-density storage (where the “noisy channel” is the loss of signal integrity between

when data is written and when data is read back out) (Figure 1.1). This capability, however,

comes at the cost of a channel bandwidth overhead as well as extra digital computation.

The demand for higher data rates and better energy efficiencies continues to rise, and thus

the realization of future systems will require the development of more powerful forward error

correction schemes.

Binary low-density parity-check (LDPC) codes are one type of FEC codes that were

initially discovered in the 1960s by Gallager [1] and have only recently become practical

due to the advancement of digital signal processing (DSP) techniques and complementary

metal-oxide-semiconductor (CMOS) processes. Relative to traditional block codes such as

Hamming codes [2], BCH codes [3], and Reed-Solomon codes [4], these codes are well-known

to exhibit excellent error correction performance, and thus have increasingly been applied as

an FEC solution in many systems and standards, such as 10 Gigabit Ethernet (10GBASE-

T), digital video broadcasting (DVB-S2), WiMAX (IEEE 802.16e), Wi-Fi (IEEE 802.11ac),

2

high-capacity data storage systems, and deep-space communications [5,6,7,8,9,10]. Various

systems impose various requirements on the FEC in terms of the code rate, length, and

target frame error rates (FER) and bit error rates (BER). Wireless communication systems

tend to employ shorter codes with lower rates and target FERs of ∼ 10−6, whereas wired

communication systems and storage systems require FERs of ∼ 10−12 and below, and thus

use longer codes and higher code rates.

The trouble with LDPC codes is that they are notorious for their decoding algorithm

complexity, and field-programmable gate array (FPGA) or application-specific integrated

circuit (ASIC) implementations of the decoding algorithms are costly in terms of hardware

resource utilization and energy efficiency. For these codes, however, research advancements

have made ASIC implementations of decoders relatively practical in terms of achievable

throughput and power consumption [9, 10, 11, 12, 13]. A number of techniques at both the

algorithm and architecture levels are employed that enable LDPC codes to be utilized in

the real applications mentioned above. The end user is never satiated, however, and the

requirement for higher communication throughput and higher storage density calls for the

development of codes with even better performance.

In the pursuit of codes with higher coding gain, non-binary LDPC (NB-LDPC) codes

defined over a Galois field of order q (GF(q)), where q is a power of 2, have risen as a strong

candidate [14], [15]. For codes defined with similar rate and length, NB-LDPC codes exhibit

a significant coding gain improvement relative to that of their binary counterparts, including

lower error floors [16]. Furthermore, NB-LDPC codes overcome some weaknesses of binary

LDPC codes in other qualities, such as the error rate performance when shorter code lengths

or higher-order modulation are used [15], or their performance in non-AWGN channels, such

as those for storage devices [17]. Unfortunately, NB-LDPC codes are currently limited from

practical application by the immense complexity of their decoding algorithms, because the

improved error-rate performance of higher field orders comes at the cost of increasing de-

coding algorithm complexity [18]. This trade-off has spurred interest in research, both from

the algorithm perspective and the digital hardware perspective, on the implementation of

decoding algorithms at reduced costs [19,20,21,22,23,24,25,26]. However, ASIC implemen-

3

tation results have yet to attain numbers that rationalize the use of NB-LDPC schemes in

the communication and storage systems of today. At the algorithm level, too many sim-

plifications cannot be made or the coding gain will deteriorate to a point where NB-LDPC

codes no longer make sense. At the architecture level, the immense amount of computation

required is difficult to avoid, and there also exists large latencies inherent in the signal flow

due to data dependencies.

This dissertation presents our work on bringing NB-LDPC code decoders closer to prac-

tical deployment. From the algorithm side, we attempt to simplify the decoding algorithm

without sacrificing the coding gain performance, which is of course easier said than done.

We propose several decoding algorithm choices, each of which are suitable under varying

conditions. Thus, for a particular application an appropriate decoding method must be cho-

sen to optimize the decoder as much as possible. From the hardware implementation side,

we propose techniques for the evaluation of hardware implementation costs based on the

decoding algorithm enabling the evaluation of the Galois Field order GF(q) as a parameter

to our design space. Additionally, we present techniques such as our quantization scheme

and computation unit sharing architecture that finally enables our realization of the highly

optimized ASIC NB-LDPC decoder implementation. We develop a flexible FPGA platform

which enables both a FER simulation for code evaluation as well as hardware cost estima-

tion. Finally, we fabricate a proof-of-concept ASIC NB-LDPC decoder which incorporates

the techniques we propose and report the measured benefits relative to prior art.

1.2 Dissertation Outline

Chapter 2 introduces the decoding of NB-LDPC codes, and defines terms and variables

that will be used in the rest of the dissertation. A survey on prior art is also conducted

to explore the state-of-the-art designs of NB-LDPC code decoder implementations.

Chapter 3 details our proposed Pruned Min-Max algorithm that enables the simplifica-

tion of the check-node computations, which are often the bottleneck in the computational

4

complexity of NB-LDPC decoding. The hardware implementation cost is evaluated in

order to show the benefits achieved by adopting this scheme.

Chapter 4 details our proposed Logarithmic Quantization Scheme, an implementation

technique that allows us to greatly reduce the computational complexity cost of NB-

LDPC decoding with only a small performance penalty in terms of coding gain. An error

profile analysis as well as hardware implementation cost analysis is also presented.

Chapter 5 details the implementation of our FPGA platform for code performance eval-

uation, spurred by the requirement for hardware-accelerated NB-LDPC decoding simu-

lators for the observation of ultra-low FER regimes. Some of the techniques introduced

are employed here, and hypotheses from the previous chapter are verified by means of

the increased simulation capability brought on by this FPGA platform.

Chapter 6 introduces our proposed scheme of combining hard- and soft-decision based

decoders for a further optimized decoder design. The new algorithm allows for a much

higher throughput on average without penalty in coding gain. The applicability of this

technique as well as implementation considerations are discussed.

Chapter 7 discusses the design of our ASIC realization of an NB-LDPC decoder based

on the techniques proposed so far. Details concerning the architecture as well as indi-

vidual building blocks are discussed. We also present the fabrication and measurement

results of the ASIC implementation, along with a comparison with the state-of-the-art.

Chapter 8 concludes the dissertation and provides some possible directions for future

research.

5

CHAPTER 2

Decoding Algorithms for Non-Binary LDPC Codes

and Their Implementation

2.1 Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Binary AWGN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Probability Domain Decoding . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.4 FFT-QSPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.5 Min-Max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6

2.1 Decoding Algorithms

2.1.1 Preliminaries

The ultimate goal in decoding NB-LDPC codes is the following: given some information

about each received symbol from the channel, try to find a codeword that satisfies the parity

check of the code. We begin by defining some variables.

A Galois field of order q, where q = 2p, is closed under the operations of addition and

multiplication modulo an irreducible polynomial of degree p, called the primitive polynomial,

whose coefficients are in GF(2). The elements of GF(q) are polynomials of degree less than

p, or equivalently, powers of the root of the primitive polynomial. An NB-LDPC code is a

linear block code defined by its M × N parity check matrix H whose entries hm,n, where

0 ≤ m ≤ (M − 1) and 0 ≤ n ≤ (N − 1), are field elements a ∈ GF(q). For a sufficiently

well constructed parity check matrix, the code rate is given as r = 1 − dvdc

= 1 − MN. A

valid codeword x = (x0, x1, ..., xN−1), xn ∈ GF(q) is a vector in the nullspace of H, i.e. the

syndrome z = x×HT = 0. Practical decoding algorithms are based on the Tanner graph of

this code, constructed as follows:

· The columns of H are represented by N variable nodes, and the rows of H are repre-

sented by M check nodes (Figure 2.1(a)).

· For each column (variable node), every non-zero (hm,n ∈ {GF(q) \ 0}) element in that

column is represented by an edge connecting that variable node to the corresponding

check node (row). The edge is weighted by the non-zero element hm,n (Figure 2.1(b)).

· Equivalently, for each row (check node), every non-zero element in that row is rep-

resented by an edge connecting that check node to the corresponding variable node

(column), weighted by hm,n (Figure 2.1(c)).

The final result is shown in Figure 2.1(d) (edge weights abbreviated).

Let Im denote the set of variable nodes that are adjacent to check node m, and Jn denote

the set of check nodes adjacent to variable node n. Regular NB-LDPC codes have constant

7

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

α5 α3

α2

α4

α3

α

α5

α6

α6

α3

α2

α4

α6

H =Variable Nodes

Check Nodes

(a)

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

α5 α3

α2

α4

α3

α

α5

α6

α6

α3

α2

α4

α6

H =1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

α5 α3

α2

α4

α3

α

α5

α6

α6

α3

α2

α4

α6

H =

(b)

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

α5 α3

α2

α4

α3

α

α5

α6

α6

α3

α2

α4

α6

H =1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

α5 α3

α2

α4

α3

α

α5

α6

α6

α3

α2

α4

α6

H =

(c)

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

α5 α3

α2

α4

α3

α

α5

α6

α6

α3

α2

α4

α6

H =

(d)

Figure 2.1: Construction example of the Tanner graph representation of a parity check

matrix.

8

check and variable node degrees, denoted by dc = |Im| and dv = |Jn|, and are also referred to

as (dv, dc) codes. Let A(m) denote the collection of ordered sets of dc GF(q) elements that

satisfy check equation m. Furthermore, let A(m|xn = a) denote the collection of ordered

sets of (dc − 1) GF(q) elements that satisfy check equation m, given that the nth element in

the codeword x is equal to a.

2.1.2 Binary AWGN Channel

The Additive White Gaussian Noise (AWGN) channel often serves as a common baseline

for comparing the performance of a variety of codes. This is because this channel model is

relatively realistic (more so than simplistic channels such as the binary erasure channel (BEC)

or the binary symmetric channel (BSC)) while remaining mathematically manageable. For

example, thermal noise has a flat power spectral density (PSD) because the source of the

noise is the sum of the movement of charge of electrons excited by the external temperature,

which tends to have a Gaussian distribution due to the central limit theorem (CLT).

When non-binary GF(q) symbols are sent through a binary channel, each symbol is

divided into log2(q) bits which are the coefficients in the polynomial representation of the

symbol. The polynomial representation is chosen over the power representation (using log2(q)

bits to represent the exponent of the root of the primitive polynomial) because the GF(q)

operations of addition and multiplication are simpler.

In the AWGN channel (Figure 2.2), a signal x is sent by the transmitter, which encodes

some information that we wish to convey to the receiver. Noise n is added to this sent

value and the receiver observes y = x + n. The noise n is a random variable with a normal

distribution N(µ = 0, σ2 = N0

2

), where µ is the mean, σ2 is the variance, and N0 is the noise

PSD. Assuming binary phase-shift keying (BPSK) modulation and equiprobable inputs,

the transmitter will send ±√Es and the received signal y will also be a random variable

with a normal distribution N(µ =

√Es, σ

2 = N0

2

), where Es is the energy per symbol sent

over the channel. With the redundancy introduced in FEC, the energy per information bit

transmitted can be expressed as Eb = Es

r. The signal-to-noise ratio Eb

N0characterizes the

9

x y

n

Figure 2.2: A visualization of the binary AWGN channel.

AWGN channel completely; in realistic scenarios and many simulations Es is kept constant

(the transmitter does not change its output power) and the noise level N0 is varied (the

channel conditions are changed), but equivalently, σ2 can be kept constant while µ is changed.

This signal-to-noise ratio can be expressed as:

Eb

N0

=µ2

2rσ2. (2.1)

The probability density function (pdf) of a normal distributionN (µ, σ2) can be expressed

as:

f(x, µ, σ) =1√2πσ2

exp

(−(x− µ)2

2σ2

). (2.2)

Given some received value y, the probability that the transmitter sent +√Es is:

p(x = +

√Es

⏐⏐⏐ y) = p0 =f(y, µ, σ)

f(y, µ, σ) + f(y,−µ, σ), (2.3)

and the probability that the transmitter sent −√Es is:

p(x = −

√Es

⏐⏐⏐ y) = p1 = 1− p0 =f(y,−µ, σ)

f(y, µ, σ) + f(y,−µ, σ). (2.4)

10

Since a GF(q) symbol is broken up into log2 q bits in transmission over the binary channel,

the probability that the symbol a ∈ GF(q) was sent given multiple received values can be

calculated as the product of log2 q probabilities given by Equations 2.3 and 2.4.

The log likelihood ratio (LLR) of a received bit can be computed as:

ln

(p0p1

)= ln

(p0

1− p0

)= 4yµ

1

2σ2. (2.5)

We can normalize either µ or σ2 to be equal to 1. If µ = 1, y must have a distribution

N(µ = ±1, σ2 = 1

2rEbN0

), and the LLR can be expressed as:

ln

(p0p1

)= 4yr

Eb

N0

. (2.6)

If σ2 = 1, y must have a distribution N(µ = ±

√2r Eb

N0, σ2 = 1

), and the LLR can be

expressed as:

ln

(p0p1

)= 2y

√2r

Eb

N0

. (2.7)

The LLR corresponding to a symbol a ∈ GF(q) is defined as:

Ln(a) = ln

(Pr(xn = a)|channelPr(xn = a)|channel

), (2.8)

where a is defined to be the most likely field element for variable node n, i.e., Pr(xn = a)

is maximum when xn = a. Note that this definition of LLRs yields smaller LLRs for more

likely field elements, allowing the LLRs to be interpreted as a “distance” metric from the

most likely field element [26]. This symbol LLR can equivalently be calculated as the sum of

individual bit LLRs, normalized to the LLR of the most likely symbol so that the minimum

LLR is equal to zero. Either the probability or LLR a priori channel information is utilized

as input to the decoding algorithms.

11

2.1.3 Probability Domain Decoding

In iterative decoding algorithms of NB-LDPC codes often referred to as “Message-Passing”

or “Belief-Propagation” algorithms, messages consisting of a vector of probabilities or LLRs

are passed back and forth between adjacent variable nodes and check nodes. Let Qm,n(a)

be the message from variable node n to check node m, and Rm,n(a) be the message from

check node m to variable node n, for the GF(q) element a. Let Qn(a) be the a posteriori

information of variable node n for the field element a, and yn be the hard decision of variable

node n, determined as the GF(q) element a associated with the minimum LLR in Qn(a).

Traditional NB-LDPC decoding is conducted as follows [16]. The superscript k indicates

the current iteration, and K is the maximum number of iterations allowed.

1) Initialization: The iteration index k is initialized to 0, and the a posteriori information

as well as the messages from the variable nodes are initialized to be equal to the a priori

probability information from the channel:

Qn(a) = Q(0)m,n(a) = pn(a). (2.9)

2) Termination Check : A hard decision y = (y0, y1, . . . , yN−1), y ∈ GF(q)N is made based

on the most likely symbol and the syndrome s = (s0, s1, . . . , sM−1), s ∈ GF(q)M is computed:

yn = argmaxa∈GF(q)

Qn(a), (2.10)

s = y ×HT . (2.11)

If either s = 0 or k = K, then y is output as the result of the algorithm. Otherwise, k is

incremented by 1.

3) Check Node Processing : The messages from check nodes to variable nodes are updated:

12

R(k)m,n(a) =

∑(an′ )∈A(m|xn=a)

⎛⎝ ∏n′∈Im\{n}

Q(k−1)m,n′ (an′)

⎞⎠ . (2.12)

The variable n′ is the index to the adjacent variable nodes for this check node m, except

for the destination of this message, n. The (dc−1)-tuple (an′) is a set of GF(q) elements that

satisfy check equation m, given xn = a. For each such solution set, the associated probability

is computed with the product (the probability that the first symbol is some symbol a0, AND

the second symbol is some symbol a1, AND etc.). The probabilities of all solutions sets are

summed (the probability that the correct solution is some set (a0), OR the correct solution

is some set (a1), OR etc.) and is used as the output message of this check node.

4) Variable Node Processing : The messages from variable nodes to check nodes are

updated:

Q′(k)m,n(a) = α× pn(a)×

∏m′∈In\{m}

R(k)m′,n(a), (2.13)

where α is some normalization scaling factor. This outgoing message from a variable node

is the product of probabilities from the channel and the adjacent check nodes except for

the destination check node of the message. The product is normalized so that the sum of

probabilities becomes 1.

In addition, the a posteriori information is updated (used to make a hard decision in the

next iteration):

Qn(a) = pn(a)×∏m∈In

R(k)m,n(a). (2.14)

5) Iteration: Go to step 2) Termination Check. ♦

Because Equation 2.12 is computed as the sum of products, this algorithm is also referred

to as the sum-product algorithm.

13

2.1.4 FFT-QSPA

Within the algorithm described in the previous section, the most computationally cumber-

some portion is Equation 2.12. Namely, the set A(m|xn = a) is large and impractical to find

directly. The Fast Fourier Transform-based Q-ary Sum-Product Algorithm (FFT-QSPA)

has been proposed by [18] as a simplification to the traditional decoding method, and has

yielded a speed-up in software simulations.

The key insight to deriving this simplification (as well as to get a good intuitive under-

standing of the decoding of NB-LDPC codes in general) is to observe that, in GF(q), the sum

of products described in Equation 2.12 can be thought of as a (log2 q)-dimensional circular

convolution of (dc−1) tensors with two discrete points in each dimension. Figure 2.3 depicts

an example in GF(8).

An element a ∈ GF(8) can be represented as a set of three bits {a2, a1, a0}, which

indicates the set of coefficients of the polynomial representing that element. To compile a

message of probabilities for some variable node, the probability of that variable node being

some element a ∈ GF(8) is placed in the tensor in the corresponding location, and the

tensor is sent back and forth as the message (instead of a one-dimensional vector) in the

message-passing algorithm. Once the tensor is populated, a permutation of the tensor, one

corresponding to each element a ∈ GF(8), can be defined (Figure 2.3(a)). The permutation

is conducted in such a way that any dimension with a “1” becomes swapped (indicated by

a gap in the figure).

In passing a tensor from a variable node to a check node, the tensor contents are shuffled so

that each location contains the probability corresponding to a×hi,j ∈ GF(8) where hi,j is the

edge label on the tanner graph. To compute the element of the output tensor corresponding

to, for example {1, 0, 1}, we must find all of the sequences that would satisfy the check

equation (the finite field sum equals zero), if the last element were {1, 0, 1}. Equivalently, the

finite field sum of all elements except for the last element must equal {1, 0, 1}. This condition

is achieved through the permutation of the tensor; when one of two tensors undergoes a

permutation of {1, 0, 1}, then each element-by-element finite field sum becomes equal to

14

{1, 0, 1} (Figure 2.3(a)). Thus to populate the {1, 0, 1} space in the output tensor, the

sum of element-by-element products is computed. To compute the entire convolution, the

sum of products is computed for each permutation. Because the convolution operation is

associative, the entire check node computation can be taken care of two tensors at a time.

We can take this one step further by applying a common technique employed to reduce

the (computational and conceptual) complexity, which is to convert signals into the Fourier

domain. That is, this convolution in the “time” domain can be computed as a multiplica-

tion in the “frequency” domain. To perform this conversion, a (log2 q)-dimensional 2-point

discrete Fourier transform (DFT) is applied to each message tensor. A multi-dimensional 2-

point DFT is equivalent to a Walsh-Hadamard transform, which, unlike the one-dimensional

q-point DFT, does not require any multiplications by the so-called “twiddle factors” (ex-

cept for negation). After the tensors in the “frequency” domain are multiplied together

element-wise, the inverse DFT (conveniently, the Walsh-Hadamard transform is involutive)

is performed to complete the check node computation.

Armed with this insight, we can now define the FFT-QSPA [18] as follows.

1) Initialization: Same as Section 2.1.3.

2) Termination Check : Same as Section 2.1.3.

3) FFT-Based Check Node Processing : The messages from variable nodes to check nodes

are first permuted according to the corresponding element in the parity-check matrix H:

Q(k−1)m,n (a) = Q

(k−1)m,n′ (h

−1m,n × a). (2.15)

Then, the element-wise product is computed in the “frequency” domain:

U (k)m,n = F

(Q

(k−1)m,n′

). (2.16)

V (k)m,n(a) =

∏n′∈Im\{n}

U(k)m,n′(a). (2.17)

15

R(k)m,n = F−1

(V

(k)m,n′

). (2.18)

Finally, the outgoing messages to variable nodes are permuted back:

R(k)m,n(a) = R

(k)m,n′(hm,n × a). (2.19)

4) Variable Node Processing : Same as Section 2.1.3.

5) Iteration: Same as Section 2.1.3. ♦

The FFT-QSPA performs very well in software simulations and is often used as a basis

for comparison between various constructions of finite-length codes, and so on.

2.1.5 Min-Max

Thus far, the decoding algorithms have been manipulating probabilities, which are numbers

residing between 0 ≤ p ≤ 1. However, it is highly desirable to deal with LLRs in a hardware

implementation of the decoding, from a numerical stability perspective (for example, when

two small probabilities are multiplied together, the result is an extremely small number).

Furthermore, the multiplications (in the probability domain) required in the variable node

computations simplify to additions (in the LLR domain).

However, the transformation of probabilities to LLRs makes the addition of probabilities

rather difficult. Not only does this mean that the DFT becomes difficult to apply, but also

even without the DFT the check node computation becomes difficult. Thus, several approx-

imations (akin to those in simplified binary LDPC decoding schemes) becomes necessary.

Since the sum of two probabilities is dominated by the larger probability, we can approxi-

mate pa + pb ≈ max (pa, pb). In the LLR domain, this becomes the minimum function, since

symbols with higher probability have a smaller LLR (as mentioned in Section 2.1.2).

Another approximation can be made in the same vein, in order to simplify the decoding

further. While the multiplication of probabilities can be computed as the sum of LLRs, the

outcome of this function is dominated mostly by the smaller of probabilities, or the larger

16

of LLRs (essentially an approximation of the L1 norm by the L∞ norm). Therefore, the

sum-of-products in Equation 2.12 can be computed as the minimum-of-maximums, allowing

the decoding to occur in the LLR domain. This leads to the definition of the Min-Max

algorithm [26]:

1) Initialization: The iteration index k is initialized to 0, and the a posteriori information

as well as the messages from the variable nodes are initialized to be equal to the a priori

LLR information:

Qn(a) = Q(0)m,n(a) = Ln(a). (2.20)

2) Termination Check : A hard decision y = (y0, y1, . . . , yN−1), y ∈ GF(q)N is made and

the syndrome s = (s0, s1, . . . , sM−1), s ∈ GF(q)M is computed:

yn = argmina∈GF(q)

Qn(a), (2.21)

s = y ×HT . (2.22)

If either s = 0 or k = K, then y is output as the result of the algorithm. Otherwise, k is

incremented by 1.

3) Check Node Processing : The messages from check nodes to variable nodes are updated:

R(k)m,n(a) = min

(an′ )∈A(m|xn=a)

(max

n′∈Im\{n}Q

(k−1)m,n′ (an′)

). (2.23)

The variable n′ is the index to the adjacent variable nodes for this check node m, except

for the destination of this message, n. The (dc − 1)-tuple (an′) is a set of GF(q) elements

that satisfy check equation m, given xn = a. From each such solution set, the least likely

symbol and its LLR are found by the max function and are associated with the set. Of these

LLRs, the most likely one is found by the min function and is used as the output message

of this check node.

17

4) Variable Node Processing : The messages from variable nodes to check nodes are

updated:

Q′(k)m,n(a) = Ln(a) +

∑m′∈In\{m}

R(k)m′,n(a), (2.24)

and

Q(k)m,n(a) = Q′(k)

m,n(a)− mina∈GF(q)

Q′(k)m,n(a). (2.25)

The outgoing message from a variable node is the sum of LLRs from the channel and

the adjacent check nodes except for the destination check node of the message. This sum is

normalized so that the LLR of the most likely symbol is always 0.

In addition, the a posteriori information is updated:

Qn(a) = Ln(a) +∑m∈In

R(k)m,n(a). (2.26)

5) Iteration: Go to step 2) Termination Check. ♦

To avoid direct computation of Equation 2.23, the forward-backward computation is

often employed [26]. Forward and backward metrics are first calculated serially based on

input messages from adjacent variable nodes, and the output messages to variable nodes are

calculated by combining the forward and backward metrics. Let ni, 0 ≤ i ≤ (dc − 1), be

indices of the adjacent variable nodes to some check node m, i.e. ni ∈ Im. The metrics and

the output messages are calculated recursively as follows:

Forward Metrics (i = {0, 1, ..., dc − 2}):

F0(a) = Qm,n0

(h−1m,n0

× a), (2.27)

Fi(a) = mina′+hm,ni×a′′=a

(max (Fi−1(a′), Qm,ni

(a′′))) . (2.28)

18

Backward Metrics (i = {dc − 1, dc − 2, ..., 1}):

Bdc−1(a) = Qm,ndc−1

(h−1m,ndc−1

× a), (2.29)

Bi(a) = mina′+hm,ni×a′′=a

(max (Bi+1(a′), Qm,ni

(a′′))) . (2.30)

Output Messages (i = {0, 1, ..., dc − 1}):

Rm,n0(a) = B1(a), (2.31)

Rm,ni(a) = min

a′+a′′=−hm,ni×a(max (Fi−1(a

′), Bi+1(a′′))) , (2.32)

Rm,ndc−1(a) = Fdc−2(a). (2.33)

The basic operations inherent in the forward-backward computations are finding the

maximums of pairs of numbers, then finding the minimum of many numbers (Figure 2.4).

This minimum-of-maximum function is a core building block when considering the decoder

implementation.

2.2 Hardware Implementation

The improved coding gain over binary LDPC codes have sparked interest and a large amount

of research in NB-LDPC codes and their decoders, but these codes remain impractical due

to the decoder hardware implementation complexity. One of the first reported realizations of

NB-LDPC decoders in hardware was [27], which implemented a GF(8) code with N = 720,

with an achieved throughput of 1Mbps on their FPGA prototype. Their algorithm of choice

was the FFT-QSPA (but with conversions between probability and LLR domains), and the

throughput is low because of the highly serial architecture.

A summary of selected prior art is shown in Tables 2.1, 2.1. The complexity problem

19

of NB-LDPC codes is quite clear. On one hand, implementations of standard decoding

algorithms with respectable codes parameters [19,20,21,22] are all quite costly for moderate

throughputs. On the other hand, to achieve high data throughputs, either a trivial code

must be chosen [23] or a simplistic decoding algorithm must be implemented [24], both of

which result in a severe degradation of coding gain and a high error floor.

20

a0

a1

a2

000

001

010

011

100

101

110

111

(a)

101

=

000 101

=

(b)

Figure 2.3: Example in GF(8) of a 3-dimensional circular convolution of tensors. (a)

The indication of dimensions when each element a ∈ GF(8) is represented as three bits

{a2, a1, a0}, and the corresponding permutations of tensors. (b) An example for finding one

element of the output tensor of the convolution operation as the sum of products.

21

Q0 Q1 Q2 Q3

R0 R1 R2 R3

Q4

R4

Q5

R5

F0 F1 F2 F3 F4

B1 B2 B3 B4 B5

Figure 2.4: Conceptual diagram of check node computations with forward-backward cal-

culations, for dc = 6. The solid circles represent basic minimum-of-maximum computations,

whereas the dotted circles are simple connections with no computations. F and B indicate

where the forward and backward metrics are calculated.

22

Table 2.1: Summary of Selected Published Works 1

[19] [20]

GF(q) 32 32 32 32 32 32

Code Length (Symbols) 620 744 837 837 620 248

dv, dc 3,6 3,24 4,27 4,27 3,6 4,8

Decoding Algorithm Min-Max

Min-Max

Min-Max

Min-Max

Min-Max

Min-Max

Frequency (MHz) 200 200 200 260 260 260

Maximum Iterations 15 15 15 15 10 10

Throughtput (Mbps) 21 21 16 29 66.6 47.7

Gate count estimate (106

NAND)1.24 1.07 1.37 3.28 2.14 1.92

[21] [22] [23] [24]

GF(q) 32 32 32 64 32

Code Length (Symbols) 837 744 837 160 837

dv, dc 4,27 3,24 4,27 2,4 4,26

Decoding Algorithm Min-Max SMS SMS EMS SES-GBFDA

Frequency (MHz) 150 200 200 700 277

Maximum Iterations 15 15 15 10∼30 10

Throughtput (Mbps) 10 64 64 1150 716

Gate count estimate (106 NAND) 1.6 1.05 1.29 2.78 0.47 (XOR)

23

CHAPTER 3

The Pruned Min-Max Algorithm

3.1 Figure-of-Merit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Parameters and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.2 The Fully Parallel Architecture . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Algorithm Strategy: Pruned Min-Max Decoding . . . . . . . . . . . . . . 33

3.2.1 Derivation of the Proposed Simplification . . . . . . . . . . . . . . . . . 33

3.2.2 Analysis of Decoding Performance . . . . . . . . . . . . . . . . . . . . . 36

3.2.3 Cost Analysis of the Pruned Min-Max Algorithm . . . . . . . . . . . . . 40

24

In the ASIC implementation of any DSP algorithm, changes in the algorithm itself causes

the largest impact in terms of obtainable hardware performance (throughput, power, etc.).

Therefore it is natural to first investigate potential ways to simplify the decoding of NB-

LDPC codes in order to make them more practical. In this chapter we introduce our proposed

simplifications to the Min-Max algorithm, which we call the Pruned Min-Max algorithm,

and explore the effects of our proposed changes on the coding gain as well as computational

complexity.

The contents of this chapter are mostly published in [28].

3.1 Figure-of-Merit

Before we discuss the algorithm itself, first we will introduce a figure-of-merit (FOM) which

we will utilize to quantify the computational complexity and how that translates into hard-

ware resources.

An analytical expression for the throughput of an NB-LDPC decoder is relatively straight-

forward to derive. First, let z be the number of clock cycles required to calculate a single

iteration of the decoding algorithm. Next, we assume that the decoder is operating in a

low FER/BER regime so that the output of the decoder converges to a codeword relatively

quickly most of the time, and the average number of iterations required per codeword is

denoted as Kavg. The average number of iterations per codeword, Kavg, is assumed to be

independent of the maximum number of iterations K, because the input noise realizations

that cause the decoder to take many iterations to converge are rare (although other factors

such as the maximum latency of the decoder or the required input buffer length are deter-

mined by the worst case K). Then, the product z ×Kavg is the number of clock cycles the

decoder requires per codeword on average. If the digital circuitry operates at some clock

frequency fclk, and the decoder is designed for a particular code whose length is B bits, then

25

the average throughput T of the iterative decoder in bits per second can be expressed as:

T

[bits

s

]=

fclk

[cycles

s

]×B

[bits

codeword

]z[

cyclesiteration

]×Kavg

[iterationscodeword

] . (3.1)

Of these four parameters, fclk, B, and z contribute to the cost of hardware directly, whereas

Kavg does not. Therefore, we propose the following definition for a new figure-of-merit

(FOM):

FOM =Throughput×Kavg

NAND gate count=

fclk ×B

z ×G, (3.2)

where G is the equivalent number of NAND gates in the design. This FOM is a single number

which is indicative of the measure of hardware “efficiency,” which is how well the circuit under

question fares in the speed-area tradeoff space. This is obviously closely related to the area

efficiency, and thus it can be interpreted in a similar way (the higher, the better). The

multiplication by Kavg can be thought of as adjusting the throughput to be the hypothetical

throughput if the decoder completed decoding codewords in a single iteration. Therefore,

this proposed FOM is more agnostic of Kavg, a coding gain related parameter, and thus

is a more accurate indicator of the implications of implementation than is the simplistic

throughput-area ratio. Through this formulation, we will arrive at estimations of the FOM

as a function of q by deriving estimates for z and G.

3.1.1 Parameters and Assumptions

We begin by defining input parameters to our model and listing the assumptions we make

in our modeling approach. Let B be the length of the codeword in bits, and w be the

quantization, or the number of bits in each message LLR. We will limit our discussion to

Galois fields whose order is a power of 2. Thus, the number of variable nodes in the Tanner

graph is N = B/ log2(q).

Standard-cell areas are based on estimates from a 65nm general-purpose standard-cell

library. We approximate the areas of D-Flip-Flops (DFF), full-adder cells (FA), and 2-to-1

26

VNU

Interconnect

CNU

VNU VNUVNU

CNUCNU

Figure 3.1: Top-level architecture for a fully parallel decoder.

multiplexors (MUX) to be equivalent to 5, 5 and 2.5 NAND gate areas, respectively. We

assume that a 2-input, w-bit adder is implemented as a ripple-carry adder and consists

of w FAs and w DFFs, therefore consuming 10w NAND gates in area and 1 clock cycle

to execute. Furthermore, we assume that a 2-input w-bit minimum (MIN) or maximum

(MAX) function to be equivalent in area and latency to a 2-input w-bit adder. Conceptually,

this assumption makes sense because a similar “carry” signal must be generated for both

operations. Moreover, a simple gate-level synthesis of these blocks for various w and fclk

validates this approximation. An N -input adder is implemented with a tree of (N − 1)

2-input adders which takes log2(N) clock cycles to finish computation. A similar approach

is taken for MIN, MAX, and MUX. To arrive at a total equivalent NAND gate count, we

assume that storage of one SRAM bit requires roughly 1.5 NAND gates [20]. The latency of

any memory block is assumed to be 1 clock cycle.

Because we are aiming to arrive at a FOM that captures the ratio of throughput and area,

decisions to implement low-level functions in serial or parallel are taken to have negligible

effect in our result. Also, the details of implementation or scheduling are not optimized and

the control logic overhead is ignored for simplicity.

3.1.2 The Fully Parallel Architecture

In the two-phase fully parallel decoder architecture, all the variable node messages are com-

puted in one phase, and all the check node messages are computed in another phase, allowing

for a high throughput [29]. The overall architecture requires N variable node computation

27

R0

R1

Q1

Q0

Y

Ln

Ln

R0

R1

R2

Q2

Q1

Q0

Y

R1

R2

Ln

(a)

(b)

Figure 3.2: VNU architectures for the Min-Max decoder, for (a) dv = 2 and (b) dv = 3.

28

R

Q

Q

Figure 3.3: Implementation of the CNU for the Min-Max algorithm.

units (VNU) and M check node computation units (CNU), with memories for the messages

embedded in these computation units (Figure 3.1). Therefore, the total NAND gate count

for this architecture is:

G = N ×GVNU +M ×GCNU, (3.3)

and the total number of clock cycles required per iteration is:

z = zVNU + zCNU, (3.4)

assuming there is no overlap.

The VNU receives dv incoming messages from adjacent CNUs, as well as the channel a

priori message, and generates dv outgoing messages to adjacent CNUs. The input messages

and channel a priori information are summed together, and the resulting message is nor-

malized (Figure 3.2). We will limit our discussion to codes with dv = {2, 3} for simplicity,

although the analysis holds more generally. The q-input minimum function used for normal-

ization can be implemented as a tree of (q − 1) 2-input minimum functions, which would

take log2(q) clock cycles to compute its output. Now that the VNU is implemented with

29

blocks for which the gate count and delay are known, GVNU as well as zVNU can be estimated

based on the assumptions outlined above. For example, for dv = 2, GVNU = 10w(7q − 2),

and zVNU = log2(q) + 2.

The CNU receives dc incoming messages from adjacent VNUs and generates dc outgoing

messages. The Min-Max algorithm is implemented by computing the forward-backward

a

a + 1

a2

a2 + 1

1

0

a2 + a

a2 + a + 1

a + 1: 011

0

1

a

a + 1

a2

a2 + 1

a2 + a

a2 + a + 1

Figure 3.4: Architecture of butterfly MUX structure, shown for GF(8). This particular

example shows addition by the GF(q) element α + 1, where α is the root of the primitive

polynomial and all GF(q) elements are in the polynomial representation.

MAX

MAX

MAX

MAX

MIN

MIN

MIN

MAX

MAX

MAX

MAX

MIN

MIN

MIN

MIN

Figure 3.5: MIN-MAX computation in a tree architecture, for GF(8).

30

metrics (Figure 3.3). This divides the task of the CNU down to performing the minimum-of-

maximum operation between two vectors at a time. The variable permutation of messages,

represented as the block labeled ”P” in Figure 3.3, can be implemented in a butterfly MUX

structure (Figure 3.4) (a similar structure has been proposed in [30]). This structure allows a

vector of LLRs to be permuted correctly, given that the select signal binary representation as

well as the message vector LLR ordering are both in the polynomial representation of GF(q)

elements. The MIN-MAX block computes the minimum of the pair-wise maximums, which

is the basic computation necessary in the forward-backward calculations (Figure 3.5) [19].

From equations (3.3) and (3.4), and the analysis of individual computation units, FOM

estimations for the overall architecture can be computed as a function of q (Figure 3.6).

For comparison, the FOM of published works with the same (dv, dc) [19, 20] are estimated

using reported throughputs and gate counts, and are overlaid on the same plot. Although

implementation results are only available in one field order, the match increases our confi-

dence in the accuracy of our FOM calculations. In our analysis, we assume w = 6, because

the bitwidths in reported works of [19, 20] vary from 5 to 7. We also assume that the fclk

achievable by the architecture is 500 MHz, which, while somewhat an arbitrary choice, is

4 8 16 32 64 128 256101

102

103

GF(q)

FOM

[bps

*iter

atio

ns/g

ates

]

Min−Max ModelPublished Estimates

Figure 3.6: FOM vs. q for the Min-Max decoder, for (3,6) codes. Published estimates are

from [19] and [20].

31

also a reasonable one given the technology node as well as the conservative insertion of flip-

flops in the estimation of required NAND gates. This choice will also be validated in a later

section through the use of physical synthesis tools. It is also noted that [19] and [20] esti-

mate achievable throughputs under the assumption that each codeword takes K maximum

iterations to decode. Therefore, their FOMs are calculated using K rather than Kavg to be

consistent.

Because the FOM is an indicator of the inherent tradeoff between speed and area, this

result quantifies the amount of penalty that must be paid when choosing to implement a

code in a higher field order. It is also interesting to note that for a given (dv, dc), G grows

linearly with respect to B, because in equation (3.3), N = B/ log2(q), and M = N(dv/dc).

Since T also increases linearly with B as indicated by equation (3.1), it follows that the

FOM of the overall architecture is constant with respect to B. Therefore, the code length

implemented will be constrained by other considerations, such as the overall required system

latency.

Our definition of the FOM and its analytical expression in equation (3.2) has remained

generic and thus can be applied to analyze binary LDPC decoders. However, we provide

limited discussion on binary LDPC decoders for the following two reasons. First, the accuracy

of our modeling approach is degraded, because binary decoders are more likely to have a

costly routing network and a low silicon area utilization [9, 13, 29], relative to NB-LDPC

decoders. Therefore, a block-level resource estimation based only on the required operations

will most likely overestimate the FOM. Second, the fairness of a direct comparison of FOMs

of existing works between binary and non-binary decoders is somewhat questionable. Not

only are the implemented algorithms different, but also the maturity of the field of binary

LDPC decoder implementations has yielded various features in designs which differ from

those of the existing NB-LDPC decoders. Namely, the architectures of the state-of-the-art

binary LDPC decoders are for irregular codes and have rate programmability [9, 12], for

higher performance in practical systems.

Crude estimations can give us some information, however. For example, the architecture

in [12], based on the rate-12LDPC code in the WiMAX standard, has an estimated FOM of

32

∼ 13600 according to equation (3.2), which is ∼ 7 times higher than that of the Min-Max

algorithm implemented for (3, 6) codes in GF(4), even without taking rate programmability

into account. This is the gap which must be closed (or the penalty that must be paid)

in order to realize NB-LDPC decoders as a practical solution to communication systems.

In general, while FOMs can be estimated from published works for binary decoders, it is

difficult to draw informed conclusions beyond the fact that binary LDPC decoders have

achieved higher FOMs than NB-LDPC decoders.

3.2 Algorithm Strategy: Pruned Min-Max Decoding

3.2.1 Derivation of the Proposed Simplification

The notable complexity in the Min-Max algorithm comes from the check node computation.

This computation is conceptually complex because the set A(m|xn = a) is very large; more

specifically,

|A(m|xn = a)| = qdc−2. (3.5)

Out of this set, one LLR for each element in GF(q) must be found as a particular message

LLR, Rm,n(a). The forward-backward calculations mitigate this problem by conducting the

search for the output LLR indirectly. However, an intelligent reduction of the set A(m|xn =

a) may potentially further simplify calculations while retaining error-correcting performance.

We propose the reduction of this set through the following steps.

1) Tentative Hard Decisions : Compute the tentative hard decision of the output messages

of variable nodes:

ani= argmin

a∈GF(q)

Qm,ni(a). (3.6)

2) Assumption of Existence of Errors : For any output message from a check node, assume

that out of (dc−1) tentative hard decisions, only e < (dc−1) of them at most are erroneous.

33

MAX

e errorsmore than

e errors

output message

hard

decision

MIN

MAX MAX MAX

Figure 3.7: Tree representation of the proposed simplification in the check node computa-

tion, for dc = 4 and e = 2. The dotted lines indicate the ”pruned” branches.

Thus, a new set A′(m|xn = a) can be defined as the set of GF(q) elements that satisfy check

equation m given xn = a, with the additional constraint that at least (dc − e − 1) of the

elements in this set must be a tentative hard decision. With this simplification, the size of

the set of LLRs to be considered in the check node computation is reduced:

|A′(m|xn = a)| =(dc − 1

e

)× qe−1. (3.7)

Thus, a new check node computation step in a simplified Min-Max algorithm can be

defined by replacing the set A with A′. A similar reduction has been proposed in [22] for the

EMS algorithm. We will take advantage of the fact that the output message of check nodes

in the Min-Max algorithm are one of the LLRs from the output message of variable nodes

and propose an additional simplification step.

3) Pruning of Hard-Decision LLRs : To further reduce the search space of LLRs in

the check node computation, we generate a tree of LLRs that is considered in the check

node computation (Figure 3.7). The root of this tree represents the output message of

the check node. Each branch stemming from the root represents one element in the set

A′(m|xn = a). There are (dc− 1) leaves connected to each of these sets, signifying each LLR

34

2 2.5 3 3.5 4 4.5 5 5.5 6

10−5

10−4

10−3

10−2

10−1

100

SNR

FER

Pruned Min−MaxMin−Max

GF(4)GF(8)

GF(16)GF(32)

Figure 3.8: FER simulation results for (3, 6) codes of length ∼ 1500 bits and various GF(q).

corresponding to the GF(q) element in the sequence that satisfies the check equation. In

the Min-Max algorithm, the leaves with maximum LLRs in each branch are found. Then,

of those maximum LLRs, the minimum is found as the output.

However, in the simplification that we have proposed above, (dc − e− 1) of the LLRs in

each branch are actually LLRs corresponding to tentative hard decisions, which means they

are the minimum LLR out of the message vector that comes from variable nodes. In fact,

the LLRs corresponding to hard decisions are always zero, with our particular definition

of LLRs and the normalization scheme that occurs at the end of variable node processing.

Therefore, these LLRs do not need to be considered because they will never be selected as

the maximum LLR of that particular branch. The new check node computation step is now

given by:

R(k)m,n(a) = min

(an′ )∈A′(m|xn=a)

(max

n′∈Im\{n,(n)}Q

(k−1)m,n′ (an′)

), (3.8)

35

where (n) indicates the set of adjacent variable nodes whose LLRs are tentative hard deci-

sions.

In the first simplification step, we proposed to reduce the number of branches that stem

from the root of the tree by changing the search space from A(m|xn = a) to A′(m|xn = a). In

the second step, we proposed to eliminate leaves at the bottom of the tree by not considering

tentative hard-decision LLRs. Due to this action of pruning the LLR tree, we call our

proposed algorithm the Pruned Min-Max algorithm.

3.2.2 Analysis of Decoding Performance

The Pruned Min-Max algorithm with e = 2 is simulated for a variety of codes, and the FER

and BER performance is compared against that of the original Min-Max algorithm (Figure

3.8). It can be seen that the modifications of the Pruned Min-Max algorithm incur very

little decoding performance degradation relative to the Min-Max algorithm. Simulations are

conducted for a variety of code lengths (∼ 1500, 2500 bits), parity-check matrix structures

(random, quasi-cyclic), field orders (GF(4, 8, 16, 32)), and variable-node degrees (dv = 2, 3),

which are not shown but give similar results. The choice of e is a critical factor which

affects both the performance and the hardware cost. For the most savings in computational

complexity, we would like to minimize e (in the limit, e = 0 is a decoder which passes around

only hard information). We have found through simulations that e < 2 incurs significant

performance degradation (not shown), whereas e = 2 maintains the decoding performance

close to that of the Min-Max algorithm, leading us to the choice of e = 2.

However, a simple direct comparison of error performances of the two algorithms seems

rather superficial and insufficient to conclude that one is a valid replacement candidate for the

other. Therefore, in order to understand the similar performances of the Min-Max and the

Pruned Min-Max decoding algorithms, we analyze the error profiles of these two algorithms

through simulations [31]. Given the same channel realizations, which are the inputs to

the decoding algorithms, the output vectors in error have been compared and investigated,

again for a variety of code parameters. To minimize the uncertainty of the simulation, a

36

sufficiently large sample size of frame errors (∼ 100) is simulated. Once the errors are

identified by simulating an appropriate sample size in both decoders, the following scenarios

are considered: (i) identical errors that are caused by a particular channel realization in

both decoders, (ii) different errors caused by the same channel realization in both decoders,

and (iii) errors in only one of the two decoders caused by any realization. The set of errors

described by (i), (ii), and (iii) are denoted as X, Y , and Z, respectively.

We first characterize the three scenarios of X, Y , and Z by considering the non-binary

absorbing sets of the code. Absorbing sets are of interest because decoding algorithms

have been shown to converge to these non-codeword objects in the Tanner graph, causing

erroneous outputs [32]. A subset V of the variable nodes, with |V| = a, is an (a, b) non-binary

absorbing set over GF(q) if there exists a vector of GF(q) elements (v1, v2, . . . , va) for V such

that 1) there are exactly b unsatisfied check nodes connected to V , and 2) for each variable

node in V , the number of adjacent satisfied check nodes is larger than the number of adjacent

unsatisfied checks. An example of a (4, 4) non-binary absorbing set over GF(8) is shown in

Figure 3.9. In this example, if (v1, v2, v3, v4) = (1, α, α2, 1), each variable node is adjacent to

exactly 3 satisfied (light square) and 1 unsatisfied (dark square) check nodes. Therefore, the

two conditions for the absorbing set are satisfied. As the number of iterations progresses,

the decoder can converge to these absorbing sets and remain stuck, causing errors in the

output (Figure 3.10).

In our simulations, a large majority of the errors in the set X are non-binary absorbing

set errors. In this case, both the Min-Max and Pruned Min-Max decoders converge to

some non-binary absorbing set before they reach their maximum number of iterations. In

the set Y , the Min-Max decoder converges to an absorbing set error before reaching the

maximum number of iterations, whereas the Pruned Min-Max algorithm does not. Finally,

in the majority of the cases in set Z, the Min-Max decoder outputs the correct codeword,

whereas the Pruned Min-Max algorithm does not converge. For the sets Y and Z, we

observe that the outputs of the Pruned Min-Max algorithm are close to the outputs of the

Min-Max algorithm, and validate through simulation that increasing the maximum number

of iterations for the Pruned Min-Max algorithm, for the set of inputs causing Y and Z,

37

𝛼4

1 1

𝛼3

𝛼2

𝛼6

1

1

1 1

𝑣4

𝑣2 𝑣1

𝑣3 1

𝛼4

𝛼5

𝛼4

𝛼2

𝛼4

Figure 3.9: A non-binary (4, 4) absorbing set over GF(8) based on the primitive polynomial

p(x) = x3+x+1 whose root is α. Circles indicate variable nodes, and squares indicate check

nodes.

Table 3.1: Error Profile Comparison (FER ≈ 10−5)

GF(4) GF(8) GF(16) GF(32)

Ratio of X 0.856 0.793 0.678 0.625

Ratio of Y 0.103 0.119 0.213 0.292

Ratio of Z 0.041 0.088 0.108 0.083

results in convergence to the correct codeword or an absorbing set error. This observation

suggests that the simplification of the Pruned Min-Max decoding results in a slightly slower

convergence of the decoder. This is further validated by observing the decoding evolution, or

the number of variable nodes in error as a function of the iteration index (Figure 3.10). As

can be seen, the Pruned Min-Max algorithm has at most a few more symbols in error as the

number of iterations progresses. For this particular error example, both decoders converge to

a (6, 4) absorbing set after a large enough number of iterations, but the Min-Max algorithm

is slightly faster. Thus, if the maximum number of iterations allowed was 25, then this error

would fall in set Y , whereas if the maximum number of iterations was 27 or larger, this error

would be in set X.

We now observe the relative sizes of the sets X, Y , and Z (Table 3.1). Our error profile

analysis shows that for the codes simulated, a large majority of channel realizations which

38

0 5 10 15 20 25 30 350

10

20

30

40

50

60

70

80

Iteration number

Num

ber o

f sym

bols

in e

rror

Pruned Min−MaxMin−Max

Figure 3.10: Decoding evolution for Min-Max and Pruned Min-Max decoders for one

channel realization, for one particular error simulated with the (3, 6) code in GF(8). Both

decoders converge to a (6, 4) absorbing set.

cause decoding errors in either algorithm are common between both decoding algorithms. In

other words, the set X ∪ Y is a significant portion of the set X ∪ Y ∪Z. We further observe

that of these channel realizations that cause erroneous output in both decoding algorithms,

a large portion of them result in the same decoding errors (X is a large portion of X ∪ Y ).

Therefore, not only are the errors within each set X, Y , and Z similar, but also the relative

sizes of the sets are indicative of similar behavior between the two algorithms.

We have thoroughly analyzed the simulations results to find not only that the coding

gain performances are similar, but also that the behavior of the two decoding algorithms

in terms of the error profiles are very similar. Therefore, the Pruned Min-Max algorithm is

a viable alternative decoding algorithm to the Min-Max algorithm in terms of its decoding

performance.

39

Q0

Q1

Q2

Q3

R0

R1

R2

R3

Figure 3.11: CNU architecture implementing the Pruned Min-Max algorithm, for dc = 4.

3.2.3 Cost Analysis of the Pruned Min-Max Algorithm

The proposed simplifications leading to the Pruned Min-Max algorithm have been ap-

proached from the perspective of conceptually simplifying equation (2.23), but the actual

cost savings or loss due to the proposed algorithm still need to be evaluated.

The overall architecture considered will be the fully parallel architecture, as before. Fur-

thermore, since the proposed simplification is in the check node computation, the VNU will

also remain the same. To analyze the implementation of the Pruned Min-Max algorithm,

the CNU substructure is adjusted so that the proposed computations take place (Figure

3.11). The FOM for the implementation is estimated and plotted with the FOM estimations

for the original Min-Max algorithm (Figure 3.12). The FOM analysis, in conjunction with

error-rate simulations (Figure 3.13), reveals the exact benefits of the proposed algorithm.

One possible design choice in the given example of (2, 4) codes would be to choose to imple-

ment the Pruned Min-Max algorithm in GF(4), which yields a 2 times improvement in the

FOM, without any loss in the coding gain. Alternatively, the Pruned Min-Max algorithm

will allow a decoder in GF(8) to be implemented for almost the same cost as a Min-Max

decoder in GF(4), and yields an 1dB performance improvement at FER = 10−6. This type

of informed exploration of the design space of NB-LDPC decoders is made possible due to

40

Figure 3.12: FOM comparison between the Min-Max and Pruned Min-Max architectures

across Galois field orders, for (2,4) codes.

Figure 3.13: FER simulation results for (2, 4) codes in GF(4) and GF(8) of length 2520

bits.

41

the proposed modeling approach, whereas asymptotic bound analysis of decoding algorithm

complexities would reveal at most the scaling behavior of each algorithm.

To gain confidence in the modeling methodology accuracy, we implement the Pruned

Min-Max algorithm for a variety of field orders and code lengths. We generate RTL through

scripting (the details of this procedure will be described later in Chapter 5), which reads

in a file containing the parity-check matrix and outputs the necessary Verilog files which

completely describe the implemented architecture specific to the input code. The generated

RTL is synthesized to obtain a gate-level description in a 65nm technology. The total area

is divided by the area of a NAND gate in this technology to arrive at an equivalent number

of NAND gates for the design. With the synthesized area estimates for a variety of codes,

the FOM can also be estimated with high accuracy (Figure 3.12). This strongly validates

our modeling methodology and thus allows our simple modeling approach to be utilized for

NB-LDPC decoder design space exploration.

42

CHAPTER 4

Logarithmic Quantization Scheme for the Min-Max

Algorithm

4.1 Prior Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Derivation of Computational Complexity of the Min-Max Algorithm . . 46

4.2.2 Routing Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Quantization Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 The Logarithmic Quantization Scheme . . . . . . . . . . . . . . . . . . . 52

4.4.1 The Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.2 Error Rate Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.3 Estimated Computational Complexity . . . . . . . . . . . . . . . . . . . 54

43

A very important consideration in the hardware implementation of any DSP algorithm

is the finite wordlength and its effects on the algorithm performance. In the case of LDPC

decoders, fixed point quantization is used and the number of bits are often highly limited.

This is due to the nature of LDPC decoding. Many of the commonly used decoding algo-

rithms can be broken down into fairly simple operations, such as summation, maximum,

minimum, and so on. The inherent complexity in decoding stems from the sheer number of

these operations that must be performed. Therefore, while there is not much room for sim-

plification of the operations themselves, quantization can make quite a significant impact in

terms of overall cost of implementation. However, designers may fall into certain traps when

deciding on the bit width used, negatively impacting the decoder performance. This chapter

will discuss and investigate the impact of quantization on the error profile. In addition, we

will propose what we call the logarithmic quantization scheme which does not significantly

degrade the coding gain but greatly alleviates the hardware implementation cost.

The contents of this chapter are mostly published in [33].

4.1 Prior Art

Two important considerations in practical ASIC implementations of NB-LDPC decoders are

the wordlength, or the number of bits used to represent a number, and correspondingly the

quantization scheme, or how these bits are used to represent what numbers. The wordlength

has an obvious direct impact on the hardware implementation cost; not only are the com-

putation costs a function of the wordlength, but also the signal routing overhead, notorious

for LDPC decoders, can change with the wordlength. On the other hand, the quantization

scheme affects the error rates achieved by the implemented decoder. Therefore, the choices

of these design parameters are of utmost importance in the design of hardware implemen-

tations of these decoders, because they affect both the coding gain and the implementation

costs. In the case of binary LDPC code decoders, the random-like interconnect connecting

the nodes are known to be a bottleneck in hardware implementations, whereas the compu-

tational complexity is relatively low [11]. Therefore, wordlength reduction solutions such as

44

the “Split-Row Threshold” algorithm [13] successfully reduce the hardware cost by improv-

ing the logic utilization. However in the case of NB-LDPC decoders which naturally have a

higher logic utilization [29], the check node computations are of primary concern in terms of

attempting to reduce the hardware complexity, although the interconnect of course should

not be disregarded.

A popular and straightforward method for selecting the wordlength and quantization

parameters in published NB-LDPC decoder implementations [19,22,34] is to choose a quan-

tization scheme with minimum wordlength that does not degrade the frame error rate (FER)

in simulations. This has often lead to a solution of five (or more) quantization bits, with three

integer and two fractional bits being particularly popular [19, 22, 34]. However, the imple-

mentation solutions remain costly in area, and thus a more aggressive wordlength reduction

is desirable. Meanwhile, the minimum FER in simulation in these works for determining the

wordlength is limited to ∼ 10−5, which seems rather simplistic. In fact, it is known in the

case of binary LDPC code decoders that the quantization scheme can be a source of error

floors [35,36]. Thus, previously published NB-LDPC code decoder implementations are not

only too costly in silicon area to be practical, but also potentially vulnerable to error floor

regions, due to the limited number of bits reserved for integer representation. While sophis-

ticated non-uniform quantization schemes have been proposed to mitigate the rise of error

floors in binary LDPC codes [35, 36], the effect of complicating the quantization scheme on

the decoder implementation complexity has been ignored. This is potentially an even more

significant problem for NB-LDPC code decoders of higher field orders, and traditionally,

uniform quantization schemes have been favored for their simplicity [15]. Interestingly, a

unique aspect of the Min-Max decoding algorithm for NB-LDPC codes is that the noto-

riously costly check node computations require only comparison operations. This insight

instigates the search for a monotonic (but not necessarily uniform) quantization scheme that

not only performs well in the decoding sense, but also reduces the overall hardware cost

by maintaining enough simplicity so as to not incur a large cost overhead for implementing

other arithmetic where necessary (such as variable nodes).

45

4.2 Computational Complexity

To gain a sense of the hardware implementation cost, we analyze the computational com-

plexity per iteration of the Min-Max algorithm. Our unit of measurement will be “operations

per bit,” or OP/b, where an operation is a 2-input addition (subtraction) or comparison.

For example, the computational complexity to add two b-bit numbers together would be b

OP/b (we assume that adders saturate), and the complexity to find the minimum of n b-bit

numbers would be (n − 1)b OP/b. Roughly speaking, 1 OP/b corresponds to a full-adder

cell and a D flip flop, because those cells would correspond to the cost to implement a single

bit addition. The costs of b-bit additions and b-bit comparisons are confirmed to be simi-

lar through synthesis estimates. Because we would like to use this complexity measure to

compare the implementation costs of codes with various q, we will normalize the total cost,

Ctot, by the length of the codeword in bits, B. Our final result, Ctot/B, can be interpreted

as the computational complexity required per iteration to process and decode one bit of the

output codeword.

4.2.1 Derivation of Computational Complexity of the Min-Max Algorithm

For a decoder of an NB-LDPC code defined by an N ×M parity check matrix, N , M , dv,

and dc are related by the total number of edges E in the Tanner graph:

E = Ndv = Mdc. (4.1)

Because each GF(q) symbol contains log2 q bits:

B = N log2 q. (4.2)

Let bv and bc be the wordlengths of numbers in the variable node computations and check

node computations, respectively. If Cv is the number of operations in a single variable

46

node computation, then the computational complexity of a single variable node is Cvbv.

Similarly, the computational complexity of a single check node is Ccbc. Therefore, the total

computational complexity in a single iteration of decoding is:

Ctot = NCvbv +MCcbc. (4.3)

Thus, from Equations (4.1), (4.2), and (4.3), we can derive the total computational com-

plexity per codeword bit per iteration:

Ctot

B=

1

log2 q

(Cvbv +

dvdcCcbc

). (4.4)

Now we are tasked to find Cv and Cc, which can be found from the respective equations

that define the variable and check node computations. A variable node computation is

defined by equations (2.24), (2.25), and (2.26). The total number of two-input operations

required to compute all Qm,n(a) and Qn(a) based on these equations is:

Cv = qd2v + qdv − dv. (4.5)

Similarly, a check node computation with the forward-backward calculations is defined by

equations (2.27)-(2.33), which are as a series of 3(dc−2) minimum-of-maximum computations

(Figure 2.4). Each minimum-of-maximum computation finds the larger value in q pairs of

numbers, then finds the minimum value out of those q numbers. The total number of two-

input operations required to compute all Rm,n(a) is:

Cc = 3 (dc − 2) (q (q + (q − 1))) . (4.6)

Equations (4.4), (4.5), and (4.6) give us the computational complexity of implementing the

Min-Max algorithm, as a function of the parameters q, dv, dc, bv, and bc. We see that bv

47

and bc appear only in equation (4.4) and thus directly impact the computational complexity.

It is important to note that if Cv or Cc change due to some algorithmic modifications, the

overall complexity Ctot will change but the impact of the wordlengths still remains as is in

equation (4.4). For example, if bv and bc are both reduced by a factor of two, then Ctot will

also be reduced by the same factor, regardless of the specific algorithm.

4.2.2 Routing Overhead

Although not part of the computational complexity of the defining equations, the routing

of signals is a significant overhead in the implementation of these decoders and thus must

be considered. Although the logic utilization impact of the routing is difficult to estimate,

quantified comparisons can be made by counting the required number of wires for the in-

terconnect. As described in Equation (4.1), there are E output messages from all variable

nodes, and E output messages from all check nodes. Each message consists of q LLRs, each

of which are represented with either bv or bc bits. Therefore, in a fully parallel architecture,

the total number of wires W required for these connections is:

W = qE (bv + bc) . (4.7)

Therefore, the routing interconnect is also directly impacted by bv and bc. Architectural

modifications, for example to serialize the node computations, or algorithmic modifications,

for example to maintain only the nm < q most important LLRs in each message, will change

the total number W but a large portion of W will still be linearly related to (bv + bc).

4.3 Quantization Effects

We will now study the impact of the wordlength and quantization scheme on the performance

of NB-LDPC decoder, particularly in the error-floor region. Through discussion of the error

profile, we will justify that the maximum representable number of messages is the main

48

Figure 4.1: FER curves for GF(16), (dv, dc) = (3, 6), (378, 189) code, for selected quanti-

zation schemes, and K = 20.

quantization design parameter which determines the performance of quantized NB-LDPC

decoders in error floor region. In particular, we observe that most of the so-called “non-

absorbing set errors” and “oscillation errors” are corrected as we increase the maximum

representable number of the messages in our quantization scheme. This reduced number

of errors results in a better error floor performance for quantizations with large maximum

representable numbers.

In this manuscript, a fixed-point quantization scheme with b total bits and f fractional

bits is denoted as “ubdf” (u for uniform). It is noted that f need not be positive, and a

“negative” number of fractional bits can be used to increase the maximum representable

number at the cost of precision.

The FER curves for various wordlengths and quantization schemes are simulated for a

GF(16), (dv, dc) = (3, 6), (378, 189) code over the binary additive white Gaussian noise (BI-

AWGN) channel (Figure 4.1). The u8d2 scheme employs a large number of bits and has good

precision, and is plotted as a point of reference. We first observe that some quantizations

49

schemes introduce an early error-floor region. In fact, the u3d0 and the u5d2 schemes,

which have a similar maximum representable number (≈ 7), have similar error-floor regions.

Furthermore, these curves cross over with the curve for the u3d(-1) scheme, which has a

larger maximum representable number (= 14) at SNR≈ 3.6dB. Thus, even though the u3d(-

1) scheme suffers from a performance degradation in the waterfall region, this scheme is

a better choice for implementation than are u3d0 or u5d2, assuming NB-LDPC codes are

employed in low-FER applications. A similar point of observation is that the u4d0 and u4d(-

1) schemes, which have the same number of bits, also have a cross over point at SNR≈4.4

dB. Again, the limited maximum representable number of the u4d0 scheme introduces an

error floor region which is not observed for quantization schemes with larger maximum

representable numbers.

Additionally, we observe that quantization schemes with similar precisions (u3d0 and

u4d0, for example) initially have overlapping waterfall curves, until their limited maximum

representable number yields their respective error floor regions. This phenomenon of the pre-

cision controlling the performance in the waterfall region and the maximum representable

number determining the location of the error floor is observed in simulations for codes with

varying field orders and rates. Therefore, we empirically conclude that the precision deter-

mines the performance in the waterfall region, whereas the maximum representable number

determines the location of the error floor regime.

To gain insight into the causes of different error curve shapes, the error profile of selected

schemes at SNR= 4.4dB are observed for the GF(16), (dv, dc) = (3, 6), (378, 189) code (Table

4.1). We categorize the errors at the output of the Min-Max decoder into the following

classes:

Fixed-point errors (not to be confused with fixed-point number representation):

· Absorbing-set (AS) errors : the decoder converges to an absorbing set [32].

· Non-absorbing-set (NAS) errors : the decoder converges to a subset of variable nodes

which does not satisfy absorbing set conditions.

50

Non-fixed-point errors :

· Oscillating (OS) errors : the output of the decoder oscillates between two errors.

· Non-converging (NC) errors : the decoder does not converge to any specific object.

The distributions of error types for each quantization scheme can be attributed to the

maximum representable numbers, which affect the occurrence of message saturation. For

example, a low maximum leads to messages consisting of many LLRs that equals the max-

imum after a small number of iterations. This causes the more frequent occurrence of OS

and NAS errors in quantization schemes such as u3d0 and u5d2. Because NB-LDPC codes

will target applications requiring very low error rates, it is imperative for the quantization

scheme in ASIC implementations to have the ability to represent large numbers in order to

avoid message saturation that causes the aforementioned errors. Therefore, for a given num-

ber of uniformly quantized bits, it is better to increase the maximum representable number

by trading off precision so that the waterfall curve may shift but no error floor arises, rather

than to increase the precision so as to match the waterfall curve to ideal values in high FER

regions but generate an avoidable error floor region.

The message saturation phenomenon can also be observed in the error evolution, or the

number of symbol errors as a function of the iteration count. This error evolution is plotted

for the u3d0 and u8d2 quantization schemes (Figure 4.2). We observe that for u3d0, the

errors of the decoder are generally stable after a few iterations. Therefore, increasing the

maximum number of iterations does not improve the performance. On the other hand, many

errors for u8d2 are not stabilized (i.e., not converged to a fixed point) before the decoder

reaches its maximum number of iterations. As a result, increasing the maximum number

of iterations would improve the performance of the decoder since it enables the decoder to

converge to a fixed point (most probably the correct codeword).

51

Figure 4.2: Plot of number of symbol errors vs. iteration count.

4.4 The Logarithmic Quantization Scheme

The need for cost-effective hardware implementation drives us to reduce the wordlengths of

LLRs, but we have also observed the negative impact of limited wordlengths, especially of

the maximum representable number. To achieve a cost-effective hardware implementation

without paying a severe penalty in the coding gain, we propose a logarithmic quantization

scheme which maintains a large dynamic range for even a short wordlength.

4.4.1 The Proposed Scheme

We propose a quantization rule as follows: for a b-bit scheme, all-zeros represents the number

0, and the other numbers are successive powers of 2. Thus, the number X (interpreted as an

unsigned integer) with b bits represents the number Y , which are related by the following:

Y =

⎧⎨⎩ 0 X = 0

2(X−1−f) X = 0, (4.8)

52

where f is a factor which allows the control of the smallest and largest numbers representable.

We will denote our proposed logarithmic quantization scheme with b total bits and the factor

f as “lbdf .” The uniform and logarithmic quantization schemes, as well as some illustrative

examples, are summarized in Table 4.2.

The logarithmic quantization scheme enhances the dynamic range dramatically while

maintaining the ability to represent small numbers, at the cost of increasing the maximum

“rounding” error for numbers of large magnitude. In the context of the Min-Max decoding

algorithm, this tradeoff makes sense because smaller LLRs represent more likely symbols

and the small differences may make a difference in the final hard decision of the algorithm,

whereas larger LLRs represent less likely symbols and small differences in their likelihoods

may not affect the outcome of the decoding. We will observe simulation results to validate

this intuition.

4.4.2 Error Rate Comparison

Software simulations allow the observation of error rates only to a certain level, beyond which

the simulation times become impractical. However, limited maximum representable numbers

gives rise to an error-floor region, which may be beyond the FERs observed. Therefore, it

is only fair to compare error rate curves between quantization schemes which have similar

maximum representable numbers, so that the error-floor regions are matched and coding

gains can be compared in the waterfall region.

The FER curves for various quantization schemes with a matched maximum representable

number of ≈64 are simulated and plotted, for codes of selected field orders and rates (Figures

4.3, 4.4). In our observed cases, a logarithmic quantization scheme with three bits is enough

to closely follow the performance of that of a five-bit uniform quantization scheme, with

a similar maximum representable number. Also, with the traditional uniform quantization

scheme, reducing b and adjusting f to increase the maximum representable number quickly

deteriorates the waterfall curve due to the lost precision. In addition, we observe the error

profile of our proposed logarithmic quantization scheme for the GF(16), (dv, dc) = (3, 6),

53

Figure 4.3: FER curves for GF(16), (dv, dc) = (3, 6), (378, 189) code, for various quantiza-

tion schemes with matched maximum representable number.

(378, 189) code (Table 4.1). The types of errors that appear for the proposed scheme are

similar to that of uniform quantization schemes with more bits and larger dynamic range (for

example, compare l3d0 with u8d2). Therefore, our logarithmic quantization scheme allows

for a more aggressive wordlength reduction than with a uniform scheme, for a multitude of

field orders and rates.

4.4.3 Estimated Computational Complexity

Although the logarithmic quantization scheme achieves a larger dynamic range for the same

number of bits, numerical operations with a non-uniform quantization are not straightfor-

ward to implement. For example, full-adder cells, which are generally highly optimized, can

be cascaded for a cheap implementation of an addition of uniformly quantized numbers.

However, it is not as trivial to implement additions in a non-uniform quantization scheme,

for example as seen with floating-point [37] arithmetic units which are cumbersome and ex-

pensive. Therefore, we will apply the computational complexity analysis of Section 4.2 to

quantify the cost associated with utilizing our proposed quantization scheme.

54

Figure 4.4: FER curves for GF(32), (dv, dc) = (3, 12), (300, 75) code, for various quantiza-

tion schemes with matched maximum representable number.

Our primary concern in the computational complexity are the check-node computations,

which usually dominate the complexity. However, an important aspect of the Min-Max de-

coding algorithm is that the only operations required in the check nodes are comparisons.

Therefore, as long as the quantization scheme of choice is monotonic, the check-node com-

putations can be implemented with the same logic as the uniform-quantization operations.

Because our logarithmic quantization scheme follows this, any wordlength reduction achieved

will directly reduce the computational complexity of the check-node computations.

On the other hand, variable-node computations contain sum operations. Therefore, at

the variable nodes, we convert the incoming messages into a uniformly quantized number,

and convert the outgoing messages into a logarithmically quantized number. Because the

numbers represented by the logarithmic quantization are powers of 2 (or zero), their repre-

sentation in the uniform domain is a one-hot number (or all zeros). Therefore, conversion

into the uniform domain is implemented with a simple binary-to-one-hot converter. Conver-

sion to the logarithmic domain is conducted by outputting the index of the most significant

bit that equals 1, and adding 1 to the output if the next bit also equals 1 (this implements

55

Figure 4.5: Normalized computational complexity for Min-Max algorithm implementation

with uniform and logarithmic quantization schemes, for (dv, dc) = (3, 6).

rounding). Also, we will maintain the input information from the channel to be uniformly

quantized. Thus, the variable node computations can be implemented by inserting these

converters that convert numbers between lbdf and u(2b − 1)df schemes. Equation (4.4) is

updated to include the cost of converters:

Ctot

B=

1

log2 q

(Cvbv + Cconv +

dvdcCcbc

), (4.9)

where Cconv is the cost of quantization scheme conversion, and bv = 2bc − 1. Because the

conversions are not operations closely related to additions, the logic for both converters are

implemented in Verilog and Cconv is estimated by comparing the synthesis area estimates

with that of adders targeted for the same clock frequency.

The computational complexities of uniform and logarithmic quantization schemes, as

a function of the Galois field order q, are normalized to the complexity of the l3 scheme

and compared for (dv, dc) = (3, 6) codes (Figure 4.5). We observe that the computational

56

complexity of the l3 scheme is less than that of the u4 scheme beyond GF(8), and the l3

scheme complexity approaches that of the u3 scheme as GF(q) increases. This is because

the complexity of the check node computation begins to dominate, but the logarithmic

scheme allows us to save on the complexity in the check nodes for trading off complexity

in the variable nodes. The savings by moving from a u5 scheme to the l3 scheme can

be calculated for a variety of GF(q) and (dv, dc), and is summarized in Table 4.3. For

our running example of a GF(16), (3, 6) code, employing the l3d0 quantization scheme

allows us to maintain a similar error correction curve as the u5d(-1) scheme, but reduce the

computational complexity by 32.0%. Similarly for the GF(32), (3, 12) code investigated in

Figure 4.4, the computational complexity reduction by moving from u5d(-1) to l3d0 is 36.7%.

As GF(q) increases, the overheads associated with implementing the proposed logarithmic

scheme diminish relative to the check node computational complexity, and the savings by

moving from u5 to l3 approach 40%. For any specific code, simulations comparing the coding

gains of uniform and logarithmic schemes can be used in conjunction with the computational

complexity analysis to calculate the savings achieved by utilizing the logarithmic scheme.

As for the interconnect, the expression for the number of wires (Equation (4.7)) remains

the same. Thus, a change in the wordlength will still directly impact the routing overhead.

However, in the logarithmic quantization scheme, the conversion logic can be placed at the

input and output of variable nodes, thus allowing all of the routing to be conducted with

the reduced wordlength. For example, for an l3 scheme, both bv = bc = 3 in Equation

(4.7). Therefore, the change in the number of wires is directly proportional to the change

in wordlength of the utilized scheme. For example, relative to a u5 scheme, the l3 scheme

reduces the number of wires by 40%. Thus, the logarithmic quantizations scheme is effective

in reducing both the total computational complexity, especially for higher field orders, as

well as the routing congestion.

We have identified that a quantization scheme which limits the maximum representable

number causes particular types of errors to appear more often and thus cause the error

floor region to rise. We have proposed a logarithmic quantization scheme that, when ap-

plied to ASIC implementations of Min-Max decoders, allows for a reduced wordlength while

57

maintaining a large dynamic range. These qualities result in the decoder maintaining good

coding gain and a suppressed error floor, as well as reduced total computational complex-

ity especially for higher field orders. While it is easy to overlook the significance of the

quantization scheme, the combined benefits of our proposed solution substantially eases the

complexity-performance trade-off, notorious for NB-LDPC decoders.

58

Table 4.1: Error Profile of Various Quantization Schemes

Scheme NC NAS OS size 4 AS size 5 AS size 6 AS size 7 AS size 8 AS

u3d0 0 27 65 2 0 6 13 5

u3d(-1) 0 10 36 2 2 13 28 14

u4d0 15 16 26 11 7 13 22 8

u4d(-1) 23 8 11 3 7 10 10 6

u5d2 1 16 50 1 2 5 7 6

u8d2 46 0 7 3 8 11 12 8

l3d1 38 0 6 9 13 14 19 10

l3d0 41 4 5 9 13 6 7 8

Table 4.2: Quantization Scheme Examples

X ubdf u5d2 u4d(-1) lbdf l3d1 l3d0

0 (0...000b) 0 0 0 0 0 0

1 (0...001b) 2−f 0.25 2 2−f 0.5 1

2 (0...010b) 2 · 2−f 0.5 4 21−f 1 2

3 (0...011b) 3 · 2−f 0.75 6 22−f 2 4...

2b − 2 (1...110b) (2b − 2) · 2−f 7.5 28 22b−3−f 16 32

2b − 1 (1...111b) (2b − 1) · 2−f 7.75 30 22b−2−f 32 64

Table 4.3: Percent Savings in Computational Complexity

(dv, dc) GF(4) GF(8) GF(16) GF(32) GF(64)

(3, 6) 12.3% 24.8% 32.0% 35.9% 37.9%

(3, 12) 16.6% 27.5% 33.5% 36.7% 38.3%

(3, 24) 18.3% 28.5% 34.1% 37.0% 38.5%

59

CHAPTER 5

Implementation of FPGA Platform for Code

Performance Evaluation

5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.1 Hardware Resource Utilization on FPGA . . . . . . . . . . . . . . . . . 68

5.3.2 Frame Error Rate Simulations . . . . . . . . . . . . . . . . . . . . . . . 72

60

Simulation of NB-LDPC decoding is an essential research tool commonly utilized to eval-

uate the performance of NB-LDPC codes in terms of the error rate. Software implementation

of decoding algorithms are employed to observe and validate the performance of many design

parameter choices, such as the parity-check matrix construction methodology, decoding al-

gorithm modifications, and so on. However, there is a limit to the error rate level which can

be observed in such simulations with software implementations, because beyond a certain

point, simulations simply take much too long even when a large amount of computing power

is utilized for the problem. For example, to observe a frame error rate of 10−5, approximately

107 frames should be simulated. This is because to claim any frame error rate with statis-

tical significance, at least 100 frame errors (as a rule of thumb) at that signal-to-noise ratio

should be observed. Although exact simulation times vary with computing power as well as

simulation parameters, simulation times often become too long beyond frame error rates of

around 10−5 or 10−6, as is often observed as the limit of software simulations in published

works. Therefore, there is a strong demand for hardware acceleration of NB-LDPC decoding

simulations. However, not only is a hardware implementation of NB-LDPC decoding costly

(in hardware and engineer resources), but also maintaining enough flexibility to allow for

multiple tuning knobs in the simulation framework is quite tricky. In this chapter we discuss

the details of our implementation of a flexible FPGA platform for NB-LDPC decoding sim-

ulation acceleration, enabling the evaluation of code performance at lower frame error rates

relative to software solutions.

5.1 Architecture

The hardware architecture for the decoder comes hand in hand with the code construction

and parity-check matrix structure. In our platform, we would like to achieve as much flexi-

bility as possible to be able to simulate a variety of codes, without the architecture becoming

too inefficient or the design effort becoming too difficult.

The choice to restrict the possible codes to quasi-cyclic codes [38, 39] comes naturally.

While there do exist other code construction methods that yield coding gain benefits (such as

61

Figure 5.1: Quasi-cyclic structure of parity-check matrix. The matrix consists of either

the identity matrix or a circulant matrix, which is a “rotated” identity matrix.

progressive edge growth [40]), most systems and standards (see for example [8], [7], [5]) utilize

quasi-cyclic codes because the structured nature allows for higher levels of parallelism in

hardware implementations (Figure 5.1). This still allows for the exploration of the protograph

as a design parameter.

The matrix consists of either the identity matrix or a circulant matrix, which is a “ro-

tated” identity matrix. What enables high parallelism is that in this construction, a group

of rows (columns) can be taken at a time without any of those rows (columns) sharing a

connection to the same column (row). In other words, the selected sub-matrix has row

weight (column weight) of 1. Thus, the parallelism can be as high as the size of the circulant

matrices, enabling an overall high throughput.

In our platform, there is no restriction placed on the size of the circulant matrices,

although the level of parallelism is upper-bounded by this size. The level of parallelism,

however, can be numbers smaller than the circulant size, so that we can ensure that our

design will fit on a reasonably-sized FGPA.

62

Another design choice is the field order (GF(q)) over which the code is defined. For this

exercise we restrict our field order to be at GF(8). This is for the following reasons:

(1) Restriction to a single field order greatly increases simplicity and ease of implementa-

tion.

(2) Realistically speaking, the decoding algorithm of choice itself is not a fixed parameter

of the field order. For example, the EMS algorithm [23] only works for high field

orders. This is because only a subset of each message is passed between nodes, which

requires computational overhead to find this subset (for example, sorters in hardware).

However, the field order must be large enough so that a subset suffices as the message.

For example, only keeping 32 out of 256 message LLRs may result in a significant

hardware reduction without a noticeable degradation in coding gain, because 32 out

of 256 is a small fraction but 32 is still a large number. However, keeping 4 out of 8

message LLRs will probably actually increase the hardware cost due to the overhead

required to find the 4 to keep, but also at the same time the coding gain might be

significantly degraded because 4 is a small number of LLRs to keep.

(3) GF(8) is a significant improvement over binary LDPC codes yet the hardware cost

increase is only slightly outrageous.

Although the fixed GF(8) may seem restrictive, we will see in Chapter 7 that this is actually

not a bad design choice, even in terms of the coding gain.

Our final design parameters are the node degrees. We will confine ourselves to regular

codes for simplicity. In general, variable node degrees (dv) are small, whereas check node

degrees (dc) are large. Furthermore, check node calculations employ forward-backward com-

putations (see Chapter 2), which induces a large latency due to data dependencies (thus,

the lack of ability to parallelize). As a result, variations in the check node degree are fairly

simple to account for by simply changing the control logic. However, changes in the variable

node degree require slightly more work, simply because dv is small to begin with and even a

slight change requires a significant change in the implementation. Therefore, we choose to

63

Ch. Info

Mem

Variable Node (VN)

C2V Message Memory

Check Node (CN)B

arr

el S

hif

t

V2C Message Memory

Rev

erse

Ba

rrel

Sh

ift

AW

GN

Figure 5.2: Partially parallel architecture for FPGA platform.

keep the variable node degree at dv = 3, while allowing the check node degree to vary (up

to 31). One method to allow for another variable node degree (dv = 4, for example) would

have been to basically have a library of variable node computation units, and use the correct

one as necessary. However, this was not pursued in the interest of time.

The overall architecture employs a partially parallel scheme (Figure 5.2), enabled by

the quasi-cyclic nature of the parity-check matrix. A fully parallel scheme would limit the

maximum length of the code implementable, due to the finite amount of resources on a

single FPGA. The variable node and check node units take turns reading from and writing

to their respective message memories. Barrel shifting and reverse barrel shifting suffice for

the edge connections, and the amount by which the messages are barrel shifted are signals

generated by the controller (not shown). The amount of parallelism is controllable (described

in the next section). The proposed logarithmic quantization scheme (Chapter 4) is simple to

implement, as all that is required are the conversion blocks between uniform and logarithmic

quantization domains at the input and output of the variable node unit (Figure 5.3). Thus,

the barrel shifts, message memories, and check node units are all reduced in complexity

directly by the reduction in bitwidth, at the cost of a slight overhead in the variable node

64

Ch. Info

Mem

Variable Node (VN)

C2V Message Memory

Check Node (CN)B

arr

el S

hif

t

V2C Message Memory

Rev

erse

Ba

rrel

Sh

ift

AW

GN

unif->log

log-

>un

ifReduced Compexity

Figure 5.3: Partially parallel architecture for FPGA platform, with inclusion of logarithmic

quantziation scheme.

unit for the conversion blocks. This allows us to utilize our FPGA platform not only for

FER simulations, but also to extract real hardware resource utilization benefits due to the

proposed logarithmic quantization scheme by observing FPGA synthesis results.

The variable node computations are employed in a fairly straightforward manner (Figure

5.4). The top input is the channel LLR, whereas the other two inputs are check-to-variable

node messages, chosen appropriately to compute the correct output. The minimum and

subtraction blocks normalize the message vector so that the minimum value is equal to zero.

Because the variable node degree is three, this particular architecture takes three clock cycles

to compute the outputs of a single variable node.

The check node unit employs a forward-backward computation scheme (Figure 2.4).

Thus, the architecture computes output messages serially, and contains internal local memory

to store intermediate forward and backward messages (Figure 5.5). The core computations

in this block are contained in the “MM” blocks, which implement the Min-Max functionality

in a tree (with a pipeline depth of two to improve clock frequency). There are 8 MM blocks

65

min

Figure 5.4: Variable node implementation for FPGA platform.

MM

GF+ 001 MM

GF+ 010 MM

GF+ 111 MM

loca

l m

em

ory

... ...

v→c c→v

maxmin

maxmax

minmax

min

minmax

minmaxmax

minmax

min

MM

Figure 5.5: Check node implementation for FPGA platform.

66

RTL Generation

Script

Parity-Check Matrix

Synthesis

parametrized RTL

Parallelism-dependent RTL

LLR Memory Initialization

Script

SNR (Eb/N0)

Firmware for FPGA control / communication

FPGASimulation

signal bitwidth

Figure 5.6: Automated RTL generation scheme for FPGA platform.

working in parallel, computing each of the LLR elements in the message corresponding to

each of the GF(8) elements. The Galois field permutations are hard-wired, costing nothing

in hardware. There are two memories in the local memory, one responsible for the forward

messages and the other responsible for the backward messages. In one pass, the forward and

backward messages are computed for a particular set of variable-to-check messages, while the

check-to-variable output messages are computed from the forward and backward messages

that are already stored in the local memory. Therefore, there are actually a pair of forward

and backward memories in the local storage working in a ping-pong fashion, where one pair

is used as input to compute the output messages, and the other pair is simultaneously used

as storage for the forward and backward messages being computed from the input. In the

next pass, the roles of these pairs of memories will switch.

5.2 Design Methodology

Accommodating all of the flexibility mentioned in the previous section in the hardware itself

is quite difficult. Therefore, we instead choose to have a flexible RTL generation scheme

(Figure 5.6).

67

The inputs to an FER simulation that directly impact the hardware are the signal

bitwidths (quantization scheme), and the parity-check matrix itself. The signal bitwidth

is taken care of by simply parameterizing the Verilog and changing the appropriate parame-

ter at the top level. Most of the RTL can be taken care of in this way, especially lower level

computational blocks such as the variable-node and check-node units. There are portions of

the RTL, however, that depend on the parity-check matrix and/or the amount of parallelism,

and these portions cannot be accommodated by a simple parametrization of the RTL. For

example, the module that instantiates all of the variable-node units must know how many

modules to instantiate. Also, the control logic that addresses the message memories must

know the locations of the non-zero elements within the parity-check matrix as well as the

amount of parallelism. Therefore, these modules (the top level modules, barrel shifters, and

control logic) are generated by a script that takes in the parallelism and the parity-check

matrix as input. These generated Verilog files, along with the parametrized Verilog files, are

taken together into synthesis to generate the bitstream to program the FPGA.

The SNR of the simulation changes the way the channel LLRs are initialized. These

initialization values (which also depend on the code rate) are also generated upfront, and

used by the firmware that controls the programmed FPGA to initialize the memories. The

firmware can be edited to also control parameters such as the clock frequency, maximum

number of iterations, maximum number of frames to simulate, and so on.

5.3 Results

In this section we discuss some of the results obtained from the implemented FPGA platform.

5.3.1 Hardware Resource Utilization on FPGA

Due to the flexibility provided by the automated RTL generation process, we can compare

the FPGA resource utilizations across varying design parameters. Of note, we are specifically

interested in the effect of the quantization scheme. Thus, we synthesize several designs across

68

Table 5.1: FPGA Synthesis Results (3,24) L130 P65

Bitwidth Logic Util. CombinationalALUTs

Total BlockMemory Bits Max Clock Freq

Synthesis CPUTime (minutes)

3 41 %127,021(30 %)

4,838,928(23 %) 87.2 MHz 147

4 48 %150,775(35 %)

5,931,536(28 %) 74.8 MHz 349

5 58 %180,788(43 %)

6,857,744(32 %) 71.0 MHz 508

6 67 %206,960(49 %)

7,783,952(37 %) 64.9 MHz 521

3 log 45 %155,893(37 %)

8,399,120(40 %) 87.9 MHz 171

varying codes and bitwidths and compare the synthesis results (Tables 5.1, 5.2, 5.3). In the

tables, L indicates the circulant size of the parity-check matrix and P indicates the amount

of parallelism.

It is noted that the “Total Block Memory Bits’ column is not an accurate reflection of the

actual required amount of memory, due to a design procedure limitation. In FPGA synthesis,

block RAM modules are utilized for memory, which have modular address spaces and word

bitwidths. If the amount of parallelism increases, while the total number of required bits of

storage does not change, the number of RAM ports will increase, and thus the number of

bits per RAM block decreases. However, to facilitate the design of the FPGA platform, the

address space of the block RAM used in the design was not optimized for larger amounts

of parallelism, leading to an excessive reported memory usage. One exception is the last

row of Table 5.3, where the address space of the block RAM used for storing channel LLR

information was halved (to make the design fit).

The effects of the bitwidth within the realm of the conventional uniform quantization

scheme are fairly obvious; the logic utilization increases, the combinational ALUT utiliza-

tion increases, the required memory increases, the maximum clock frequency of operation

69

Table 5.2: FPGA Synthesis Results (3,27) L114 P57

Bitwidth Logic Util. CombinationalALUTs

Total BlockMemory Bits Max Clock Freq

Synthesis CPUTime (minutes)

3 38 %116,498(27 %)

4,568,592(22 %) 82.4 MHz 128

4 44 %137,691(32 %)

5,571,088(26 %) 83.1 MHz 142

5 53 %164,996(39 %)

6,427,664(30 %) 72.5 MHz 316

6 61 %188,202(44 %)

7,284,240(34 %) 70.7 MHz 482

3 log 41 %142,822(34 %)

8,179,216(39 %) 90.9 MHz 111

Table 5.3: FPGA Synthesis Results (3,30) L104 P104

Bitwidth Logic Util. CombinationalALUTs

Total BlockMemory Bits Max Clock Freq

Synthesis CPUTime (minutes)

3 62 %190,061(45 %)

5,517,840(26 %) 70.5 MHz 750

4 67 %222,560(52 %)

7,262,736(34 %) 70.8 MHz 937

5 81 %265,805(63 %)

8,687,120(41 %) 63.8 MHz 1425

6 94 %310,621(73 %)

10,025,616(47 %) 45.4 MHz 7177

3 log 66 %229,592(54 %)

7,983,122*(38 %) 76.4 MHz 735

70

Figure 5.7: FPGA accelerated FER simulation for a GF(8), (3,27), (3078, 342) code and

comparison with software simulation results.

decreases, and the time it takes for the computer to synthesize the design increases, all as

the bitwidth is increased. Interestingly, a decrease in bitwidth from 4 to 3 sometimes does

not yield benefits (such as the maximum clock frequency in Tables 5.2, 5.3).

Additionally, the benefits of the proposed logarithmic scheme are pronounced. Logic

utilization is below that of 4 bit quantization, and combinational ALUT utilization is only

slightly larger. The maximum clock frequency and synthesis CPU time (the ease of meeting

timing) are on par with 3 bit quantization. The total memory usage is much larger because

of the design shortcoming mentioned previously, and also because while the check node and

message memories are in the 3 bit logarithmic quantization domain, the channel LLR storage

is in a 7 bit uniform quantization domain. With optimization of these memories, the total

memory requirement can be brought down significantly.

71

Figure 5.8: FPGA accelerated FER simulation for a GF(8), (3,24), (1584, 198) code and

comparison with software simulation results.

5.3.2 Frame Error Rate Simulations

The FPGA platform provides a speed-up in FER simulations. One such example is shown

in Figure 5.7. The FER curve derived from software simulations only allows for observation

of the FER to between 10−4 ∼ 10−5, limited by the simulation speed. It is noted that

the C++ software is not run on a single CPU, but rather on a computing cluster with

many parallel cores (a single CPU simulation would thus be further limited by at least an

order of magnitude). The FPGA simulations greatly accelerate the FER simulations (the

curve is slightly offset from the software simulation due to the slight inaccuracy in LLR

calculations on hardware). In the same amount of real time, the FER curve can be extended

by approximately two orders of magnitude. This proves to be quite informative, as we know

that poor design choices can lead to early error floor regions (Chapter 4). In Figure 5.7, we

see an example of this phenomenon. However, the early error floor region occurs below the

observable FER level of software simulations. Therefore, it is likely that a very poor design

choice could have been made had we not realized this FPGA platform to observe low FERs.

72

Figure 5.8 shows another example of the gravity of the quantization scheme choice. As

the decimal point is shifted to the left, the error floor region rises, even with the logarithmic

quantization scheme. However, due to the error floor levels, it is highly likely that we would

only have been able to observe the poorest one in Figure 5.8 with software, again due to the

speed limitation. However, with the FPGA platform, we are able to distinctively claim that

the effect of the quantization scheme on the error floor is gradual, as we had suspected.

Thus, the FPGA prototyping effort has proven to be effective in not only evaluating

the effect on the hardware implementation cost of decoder parameter choices, but also in

accelerating FER simulations in order to be able to observe phenomena at lower FER levels.

We can conclude with higher confidence that the logarithmic quantization scheme is able

to suppress the hardware implementation cost effectively while maintaining superior coding

gain characteristics.

73

CHAPTER 6

Augmented Hard-Decision Based Decoding Algorithm

and Combination with Soft Decoding

6.1 Iterative Hard-Decision Based Majority Logic Decoding . . . . . . . . . . 76

6.1.1 Augmented IHRB-MLGD . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.1.2 Detection of Erasure Condition . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 Software Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.3 Combination with the Min-Max Algorithm . . . . . . . . . . . . . . . . . 83

74

It is difficult to yield an efficient hardware implementation solely based on soft decoding

algorithms such as the Min-Max algorithm, because the severe complexity cannot be dimin-

ished too much before a non-negligible coding gain penalty arises. At the extreme, crude

decoding methods such as hard-decision based algorithms aimed at simplification generally

yields extremely poor coding gain and potentially very high error floors. What we would

like to achieve is the best of both worlds: good coding gain and low error floor of the soft

decoding algorithm, and the computational efficiency of the hard decoding algorithm.

The key observation we make is that in a high-SNR, low-FER regime, most of the time

a soft-decision decoding is unnecessary to arrive at the correct codeword. For example, if at

a particular SNR a soft decoder can achieve an FER of 10−9 and a hard decoder an FER

of 10−2, then although the hard decoder has terrible coding gain, the effort put into all the

computations of the soft decoder was probably only really necessary about 1% of the time.

In fact, as long as the errors made by the hard decoder are detectable errors, then we can

initially let the crude decoder attempt to decode the received word, and only have the soft

decoder activated when the hard decoder fails. This strategy has the potential to yield great

benefits in the hardware implementation, because most of the time the decoder will exhibit

properties (power consumption, throughput, etc.) of the hard decoder, while the coding gain

performance can approach that of the soft decoder.

It remains for us to find a “crude” decoding algorithm appropriate for this strategy. The

suitable algorithm must satisfy several properties:

· The algorithm must have very low computational complexity.

· The algorithm must have a “good enough” waterfall regime (the coding gain loss

relative to the Min-Max algorithm cannot be excessive). This is because for some target

FER the crude algorithm should take care of most of the codewords for the hardware

implementation to reap the benefits due to reduced computational complexity.

· The algorithm can have a high error floor. The existence of a floor for the hard

algorithm does not matter because the residual errors will be cleaned up by the Min-

Max algorithm.

75

6.1 Iterative Hard-Decision Based Majority Logic Decoding

The Iterative Hard-Decision Based Majority Logic Decoding (IHRB-MLGD) algorithm [25,

41] is a sensible initial candidate for our crude decoding. The IHRB-MLGD algorithm can

be described as follows:

1) Initialization: The iteration index k is initialized to 0, and the a posteriori information

is initialized to be equal to either 0 if the symbol is the most likely one or some (positive)

value Γ otherwise:

Q(0)n (a) =

⎧⎪⎨⎪⎩0, if a = argmaxa′∈GF(q) pn(a

′)

Γ, otherwise

. (6.1)

The messages from the variable nodes to check nodes Q(0)m,n are initialized to be the most

likely symbol a itself:

Q(0)m,n = argmin

a∈GF(q)

Qn(a). (6.2)

2) Termination Check : A hard decision y = (y0, y1, . . . , yN−1), y ∈ GF(q)N is made and

the syndrome s = (s0, s1, . . . , sM−1), s ∈ GF(q)M is computed:

yn = argmina∈GF(q)

Qn(a), (6.3)

s = y ×HT . (6.4)

If either s = 0 or k = K, then y is output as the result of the algorithm. Otherwise, k is

incremented by 1.

3) Check Node Processing : The messages from check nodes to variable nodes are updated:

R(k)m,n = h−1

m,n ×GF(q)∑

n′∈Im\{n}

(hm,n′ × Q

(k−1)m,n′

), (6.5)

where the∑

operator indicates addition in GF(q), which are bit-wise XOR operations.

76

Equation (6.5) essentially represents the parity-check equation of the corresponding row.

4) Variable Node Processing : The a posteriori information is updated, and the messages

from variable to check nodes are correspondingly derived:

Q(k)n (a) = Q(k−1)

n (a) +∑m′∈In

⎧⎪⎨⎪⎩0, if a = argmina′∈GF(q) Rm′,n(a

′)

δ, otherwise

, (6.6)

Q(k)m,n = argmin

a∈GF(q)

Q(k)n (a), (6.7)

where δ is some positive number. For practical implementations, Qn should be normalized

such that min (Qn(a)) = 0. Essentially, the variable node updates its belief about which

symbol it should be based on the “votes” it receives from neighboring check nodes.

5) Iteration: Go to step 2) Termination Check. ♦

While the above algorithm is quite simple and cheap to implement, the performance

degradation relative to soft decoding algorithms is quite large. A larger variable node degree

helps the performance of IHRB-MLGD [25], but this does not suit our needs. The final

average computational complexity will be heavily influenced by the frame error rate of the

coarse decoding. However, in order for this entire scheme to be effective, the frame error rate

of the coarse decoding must be sufficiently low at a reasonably low signal-to-noise ratio. If

the Min-Max and coarse decoding error rate curves are too far away, the region in between

will be a place where Min-Max provides good coding gain but the hardware will struggle

to have good performance in terms of throughput and power. Thus, to close this “gap,”

we propose several modifications to the IHRB-MLGD algorithm in order to improve the

performance while maintaining the low computational complexity.

6.1.1 Augmented IHRB-MLGD

There are two fairly simple modifications we can make that will improve the performance of

the IHRB-MLGD algorithm. The proposed modified algorithm, which we call the Augmented

77

IHRB-MLGD (A-IHRB), will also better suit our needs as an initial crude decoding method.

The first change to make is to utilize soft input information from the channel, instead of

initializing the channel information to be a simple vector consisting of 0 or Γ as in Equation

(6.1). This is avoided in implementations of IHRB-MLGD [25] from the perspective of the

increased storage requirements. However, in our case, this is not an issue as the soft channel

information is required by the Min-Max algorithm anyways.

The other change we propose is to differentiate each outgoing message of a variable node.

Equations 6.6 and 6.7 are replaced with the following:

4) Variable Node Processing for A-IHRB : The a posteriori information is updated, and

the messages from variable to check nodes are correspondingly derived:

Q(k)m,n(a) = Q(k−1)

m,n (a) +∑

m′∈In\{m}

⎧⎪⎨⎪⎩0, if a = argmina′∈GF(q) Rm′,n(a

′)

δ, otherwise

, (6.8)

Q(k)m,n = argmin

a∈GF(q)

Q(k)m,n(a), (6.9)

where δ is some positive number. Again, Q is normalized every iteration for numerical

stability in an actual implementation. Additionally, the a posteriori LLR is also calculated

and kept for hard decisions.

Traditionally, variable nodes in IHRB-MLGD only maintain one “state” each that is

the cumulative information of the channel input and “votes” that come in from neighboring

check nodes every iteration. However, this is not in the spirit of traditional belief-propagation

decoding where for each outgoing message from a node, the incoming message to that node

from the target node is excluded. The original IHRB-MLGD does not do this because

maintaining a “state” for each adjacent check node increases the storage requirement at the

variable node linearly with the variable node degree dv, and IHRB-MLGD is most commonly

applied to codes with very high dv for improved performance. In our case, this change will

still increase the storage requirement at the variable node. However, we can curb this increase

by maintaining a moderate dv since we are not seeking that much performance out of the

78

coarse decoder itself.

6.1.2 Detection of Erasure Condition

The employment of a completely hard-decision-based decoding strategy relaxes the computa-

tional complexity at the expense of a usually unacceptable degradation in coding gain. This

extreme tradeoff stems from the fact that each messages only conveys information about one

symbol out of q possible symbols, and even though more might be known (at some particu-

lar variable node, for example), this other information is not transmitted as a message. To

make use of this additional information without a large increase in computational complex-

ity, we propose an additional modification to IHRB-MLGD, which we call the detection of

an erasure condition, inspired by LDPC decoding on the Binary Erasure Channel (BEC).

The modification is as follows. Each message passed between nodes will have one more bit

of information, indicating whether the symbol being passed is “erased” or not. The outgoing

symbols are still computed in the same way in both the variable and check nodes. In the

variable-to-check messages, the symbol being sent is indicated as “erased” if the variable

node is not very sure if this symbol is the correct one. Mathematically, the erasure condition

is asserted if there exists another symbol in GF(q) besides the most likely one whose LLR

falls within some threshold (the LLR of the most likely symbol is 0). This means that, for

example, the second most likely symbol is also actually somewhat likely to be the correct

solution (although it may not be the most likely at this point in time). In the check-to-

variable messages, the symbol being sent is indicated as “erased” if again, the check node

is not very sure if this symbol is the correct one. Mathematically, the erasure condition

is asserted if any of the input symbols used to produce this output is indicated as erased

(similar to the check node operations in decoding over the BEC). This is because if even a

single input to the check node equation changes, the output symbol can change drastically

(because the check node operations are a bit-wise XOR). Finally, the variable node updates

79

its internal state according to Equation 6.6, but with a slight modification:

Q(k)n (a) = Q(k−1)

n (a) +∑m′∈In

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩0, if a = argmina′∈GF(q) Rm′,n(a

′)

δe, if erasure is indicated

δ, otherwise

, (6.10)

where 0 ≤ δe < δ. By using a δe smaller than δ, the strength of an erased message is

weakened in the state update step of the variable node computation.

In terms of the hardware implementation of this algorithm, the additional cost overhead

is fairly minimal. The routing will require one extra bit per symbol to indicate the erasure

state. The check node will require some OR logic to compute the erasure state. The variable

node can either find the second minimum in the LLR vector and compare to the threshold

to see if the erasure condition upholds, or compare all of the elements in the LLR vector to

the threshold directly to see if any of the elements satisfies the erasure condition.

6.2 Software Simulation Results

To validate the effectiveness of our proposed modifications to the original IHRB-MLGD as

well as to determine which flavor of our modifications to adopt as the initial coarse decoding

scheme, we turn to software simulation for performance comparison.

The performance of the Min-Max algorithm (with logarithmic quantization) is compared

against the various modifications proposed to the IHRB-MLGD (Figure 6.1). The code in

this particular example is a GF(8) (3,27) quasi-cyclic code of length 3132 symbols. The

four coarse decoding candidates are: (1) IHRB-MLGD, (2) IHRB-MLGD with soft channel

information (FSOFT), (3) IHRB-MLGD with soft channel information (FSOFT) and an

enhanced variable node to store different information for each adjacent check node (VENH),

and (4) IHRB-MLGD with soft channel information (FSOFT), an enhanced variable node

(VENH), and the detection of the erasure condition (ERAS). The maximum number of

80

Figure 6.1: Frame error rate comparison of the various proposed IHRB-MLGDmodification

strategies.

iterations for each of these is set to be 10.

The original IHRB-MLGD exhibits too large a coding gain degradation, as was hinted

before. Especially because the code is so long and high-rate, the Min-Max curve has a very

steep waterfall slope, exacerbating the gap between it and the IHRB-MLGD curve. Simply

using the soft channel information greatly brings the error rate curve closer to that of Min-

Max, but the gap is still rather large. The compound effect of using soft channel information

and enhancing the variable node yields an error rate curve much closer to what is desired.

Including the erasure condition detection does improve the coding gain but the change is not

as drastic as the two previous techniques. Therefore, for our A-IHRB algorithm, we choose

to employ the first two techniques (FSOFT and VENH) but not detect the erasure condition

(ERAS).

We are able to get to about 0.5dB coding gain loss with respect to Min-Max at a frame

81

Figure 6.2: Frame error rate comparison of the various proposed IHRB-MLGDmodification

strategies, with increased iterations.

error rate of 10−2 in the case of this particular code. Though the slope is degraded for

A-IHRB and the floor is high, these factors do not matter as much since Min-Max will take

care of the residual errors at higher SNRs.

Another set of simulations is run for a similar length (3,24) code (Figure 6.2). The

maximum number of iterations for the IHRB-MLGD variants are increased to 60. A variant

with erasure condition detection but without an enhanced variable node is also implemented

and plotted. This variant, while seemingly has a lower error floor, has some non-negligible

degradation in the waterfall region, justifying the inclusion of the enhanced variable node and

its corresponding costs. The performance difference with VENH but with and without ERAS

is somewhat intriguing. When the error floor region starts to creep in, there is a crossover

between these two frame error rate curves. Thus, again, the erasure condition detection

helps improve the error floor characteristics, but in addition, it has some degradation in the

82

waterfall region. For the purposes of our use of A-IHRB as an initial coarse decoder, we

actually care more about the performance in the waterfall region, because we want to close

the “gap” between A-IHRB and Min-Max for high frame error rates. This example further

validates our choice to exclude the erasure condition detection from A-IHRB in our final

decoding strategy. While the erasure condition detection as an addition to IHRB-MLGD

may potentially be useful as a stand-alone algorithm, we will not investigate this further.

6.3 Combination with the Min-Max Algorithm

We will quickly discuss some of the aspects concerning the combination of A-IHRB with

Min-Max before diving deeply into the implementation details.

The FER and BER performance simulation results of the proposed dual-algorithm scheme

over the AWGN channel are shown (Figures 6.3, 6.4). Regardless of the maximum number

of iterations for the Min-Max algorithm (6 and 20 are shown), the error rate curves overlap,

which is fairly obvious. In fact, the simulation results are almost exactly the same in the

sense that the same channel inputs cause errors to occur. Very rarely does A-IHRB help

the performance by decoding a channel input correctly where Min-Max was not able to. On

the other hand, A-IHRB did unfortunately yield some undetectable (zero syndrome) errors,

causing the dual-algorithm scheme to not be able to move on to Min-Max. This phenomenon

was observed more for shorter codes, codes with smaller node degrees, and higher maximum

number of A-IHRB iterations (beyond 10). This limits the applicability of our proposed

scheme to codes that satisfy some of these characteristics. Additionally, bringing down the

maximum number of A-IHRB iterations will help, although this will degrade the hardware

performance. For example, if the maximum number of iterations of A-IHRB is set to be 8,

there are zero such undetectable errors over the simulated range of input channel Eb/N0.

Although not a good measure of the exact performance improvements, we can still observe

some of the benefits yielded by our proposed scheme in the software run-time of simulations.

We measure the time it takes for the software to decode 1000 frames as a function of Eb/N0,

83

Figure 6.3: Comparison of FER in simulation with and without A-IHRB.

Figure 6.4: Comparison of BER in simulation with and without A-IHRB.

84

Figure 6.5: Comparison of simulation time for decoding 1000 frames with and without

A-IHRB.

and compare between Min-Max only and A-IHRB with Min-Max (Figure 6.5). At low

Eb/N0, the FER of A-IHRB is high and thus there is very little difference between with

and without A-IHRB since Min-Max is invoked most of the time anyways. However, as the

Eb/N0 increases and the FER of A-IHRB decreases, Min-Max is invoked much less frequently

and thus software simulation time is shortened by a significant margin.

85

CHAPTER 7

ASIC Implementation

7.1 Parity-Check Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.2.1 Variable Node Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.2.2 A-IHRB Check Node Logic Implementation in Variable Node . . . . . . 94

7.2.3 Decoder Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.3 AWGN Channel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.4 Chip Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.5 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.5.1 Error Correction Performance . . . . . . . . . . . . . . . . . . . . . . . . 101

7.5.2 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.5.3 Comparison Against Prior Art . . . . . . . . . . . . . . . . . . . . . . . 107

86

The final requirement for the practical use of a NB-LDPC decoder is its realization on a

chip. This chapter details the ASIC implementation effort, where the techniques proposed

so far are put into actual use for proof-of-concept. The optimizations made at each level

of hierarchy (algorithm, architecture, and circuit) will be detailed. Finally, measurement

results will be presented.

7.1 Parity-Check Matrix

The parity-check matrix parameters depend on the application space. For example, wireless

communication requires lower code rates and shorter code lengths, while wireline and storage

applications are the opposite and require higher code rates and longer code lengths. For the

application of NB-LDPC codes to make sense, we target applications that require extremely

lower error rates, specifically storage. In this space, code lengths are generally longer than

8k information bits, and code rates are typically higher than 0.85.

The protograph of the adopted code for our ASIC design is shown in Figure 7.1. Each

box with a number n indicates a circulant matrix σ, where:

σ =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0 · · · 0

0 0 1 · · · 0...

......

. . ....

0 0 0 · · · 1

1 0 0 · · · 0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦, (7.1)

and the content of each box is σn (when n = 0, the box contains the identity matrix). Thus,

n dictates the “rotations” of the circulant matrices. The dimensions of each circulant (or

identity) matrix are 116×116 (this size dictates the maximum value that n can take in such

a design methodology, since σ116 = I). Since there are three (block) rows and 27 (block)

columns, the parity-check matrix dimensions are N × M = 3132 × 348. Since the code is

defined over GF(8), the length of this code is 3132 × log2(8) × 89= 8352 information bits.

87

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9

0 6 66 17 81 11 48 80 5 62 77 59 109 68 104 16 46 19 107 40 102 72 87 31 103 49 111

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0 1 2 ...

Figure 7.1: Protograph of the parity-check matrix adopted for the ASIC design, and

examples of circulant matrices.

The entries in each box are chosen so that the parity-check matrix is guaranteed to have

a girth greater than four [42]. Edge weights are assigned randomly without optimizations

because (1) the changing of edge weights does not affect the hardware cost, and (2) the edge

weight optimization will change based on the communication channel [47,48].

As an extension of Chapter 5, the GF order is chosen to be GF(8). Given the well-known

trend for the performance to improve as the GF order increases, the choice of GF(8) may

seem rather low. However, we will emphasize the fact that the GF order is absolutely not

the only parameter that plays into the error correcting capability of the code, and thus, a

blind pursuit for higher GF orders is in fact a poor design choice.

A simple example is shown in Figure 7.2. A set of quasi-cyclic codes with lengths ap-

proximately 4000 information bits and code rate = 0.9 defined over various GF orders are

compared. Although there is certainly some amount of monotonicity in the improvement

of the coding gain as the GF order is increased, the improvement of GF(16) is not observ-

able until the error floor region (which is somewhat high because of the shorter code length

and very high code rate). In fact, the waterfall region overlaps, presumably because as the

number of bits per symbol increases, the number of variable nodes decreases and the Tanner

graph shrinks because the code length in bits is kept constant. It is noted that these codes

are not that short; even at GF(16), there are still more than one thousand variable nodes,

88

Figure 7.2: Comparison of quasi-cyclic codes length approximately 4k information bits and

code rate = 0.9 across GF(q).

which is far from making the Tanner graph trivially small. Thus, a simple increase in the

GF(q) order over which the code is defined yields diminishing returns at best.

Another set of comparisons is shown in Figures 7.3 and 7.4. All of the codes plotted have

length of approximately 8k information bits and a code rate of 89. The difference between

the two figures are only in the maximum iterations allowed. There are three binary GF(2)

quasi-cyclic (QC) codes plotted, with variable node degrees (column weights) of 3, 4, and 5.

These binary codes are decoded with the Min-Sum algorithm [43,44]. As is well known, the

column weight should be larger than 4 in order to maintain good performance. The column

weight 3 code shows a relatively high error floor. The column weight 5 code falls off sharply

but there is a relatively large loss in coding gain.

The authors in [23] introduce a fully-parallel decoder architecture for a code over GF(64)

with 160 variable nodes, column weight 2, and rate 0.5 (row weight 4). The fully parallel

architecture allows for a random code construction. Although this only holds true for the very

short code, we still construct a GF(64) code randomly but with 1539 variable nodes, column

89

Eb/N0 (dB)

Fram

e E

rro

r R

ate

GF(8) (3,27) 3132 QCGF(256) (2,18) 1152 RANDGF(64) (2,18) 1539 RANDGF(2) (5,45) 9225 QCGF(2) (4,36) 9216 QCGF(2) (3,27) 9207 QC

10-4

4.6 4.8 5 5.23.8 4 4.2 4.4

10-3

10-2

10-1

100

Figure 7.3: Comparison of codes for which ASIC solution of similar codes have been

presented, except with matched lengths≈9200 bits and code rate = 89. Maximum number of

iterations is 6.

weight 2, and row weight 18 to match the length and rate to our needs. Potentially, the code

may be further optimized so that the apparently high error floor could be lowered. However,

the code length remains an issue; our required code length is more than 10 times that of what

has been implemented in [23], nullifying our ability to employ their architectural techniques.

Furthermore, the higher required code rate implies a much larger row weight, which can also

be problematic with their design (which only has a row weight of 4). Therefore, GF(64) and

column weight 2 are incompatible code parameters with our target application.

The authors in [45] introduce a fully-parallel decoder for a code defined over GF(256)

with 110 variable nodes, column weight 2, and rate 0.8 (row weight 10). Again, a random

code construction is viable due to the short code length. We plot the frame error rate for

a random GF(256) code with 1152 variable nodes, column weight 2, and row weight 18 to

again match the length and rate to our needs. Unlike the previous code in GF(64), this

code performs fairly well (more so with 20 iterations than 6) despite having column weight

90

Eb/N0 (dB)

Fram

e E

rro

r R

ate

GF(8) (3,27) 3132 QCGF(256) (2,18) 1152 RANDGF(64) (2,18) 1539 RANDGF(2) (5,45) 9225 QCGF(2) (4,36) 9216 QCGF(2) (3,27) 9207 QC

10-4

4.6 4.83.6 3.8 4 4.2 4.4

10-3

10-2

10-1

100

10-5

Figure 7.4: Comparison of codes for which ASIC solution of similar codes have been

presented, except with matched lengths≈9200 bits and code rate = 89. Maximum number of

iterations is 20.

2 perhaps due to the high GF order. However, to arrive at this performance we have of

course had to increase the code length by more than ten times, rendering the fully-parallel

architecture infeasible.

The final curve plotted in Figures 7.3 and 7.4 is our proposed code (Figure 7.1), defined

in GF(8) but with a column weight of 3 (row weight of 27). This code outperforms the

binary codes as well as the column weight 2 code in GF(64), and performs similarly to or

better than the column weight 2 code in GF(256), depending on the maximum number of

iterations. Therefore, GF(8) is highly likely to be best in terms of the trade-off space between

coding gain and hardware cost of implementation. It is noted that code design optimization

strategies based on Tanner graph edge weights such as those provided in [17,46] are not taken

into consideration here, because of two reasons. First, the edge weight optimization strategy

highly depends on the channel characteristics [17, 47, 48]. Second, the edge weight actually

does not affect the hardware cost in any way. The other well-known form of optimization,

91

namely the structural optimization of the binary LDPC matrix, is limited anyways since we

must employ a quasi-cyclic code.

Unfortunately, the choice to implement a code in GF(8) does not automatically translate

to cheap or even feasible hardware costs (for example, see [27]). This is because while GF(8)

codes are much more expensive than binary codes, many of the techniques utilized in prior

art such as [23], [45] do not apply at GF(8). Most importantly, the EMS algorithm [14], [15]

does not yield benefits until the GF order is at least approximately 32, because there is no

good choice for the truncated message length nm where nm < q when q is smaller. Either

nm is too small and the coding gain is excessively sacrificed, or nm is too close to q and

the overhead cost in hardware (for sorters, etc.) becomes too large (or both). Thus, it is

imperative for us to design the architecture carefully to reap the benefits of our GF(8) code.

7.2 Architecture

The algorithm of choice will be the dual-algorithm scheme with A-IHRB and Min-Max

decoding algorithms, detailed in Chapter 6. While the algorithm has been discussed in

detail, we have still yet to consider the actual hardware implementation strategy. An upfront

concern is the area overhead of the dual-algorithm scheme; if multiple algorithms must be

implemented, then that much extra cost must be paid in hardware. We will seek to mitigate

this overhead by close observation of the variable node computation.

7.2.1 Variable Node Architecture

The variable node computation in the Min-Max algorithm is described in Equations 2.24,

2.25, and 2.26. In essence, the core of these equations are comprised of summations and

normalizations (minimum and subtraction). On the other hand, the variable node compu-

tations in the A-IHRB algorithm is described in Equations 6.8 and 6.9. The crux of these

equations are summations and the argmin function. It is fairly straightforward to share the

summation operations, since there will be the same number of addition operations required

92

A-IHRB

Min-MaxA-IHRB

Min-Max

A-IHRB

Min-Max

sym→ vect

GF ÷ (comb)

log→unif

GF ÷(perm)

GF ×(comb)

h{0,1,2},j

hijx

LLR ch mod

GF ×(comb)

h{0,1,2},j

h{0,1,2},j

h{0,1,2},j

c2v

state(k-1)

v2c(k-1)

curr. state (k)

GF ×(perm)

h{0,1,2},j

unif→log

v2c(k)

LLR ch mod

LLR ch

A-IHRB &

Min-Max

(hard decision)

(syndrome)

new state

(k)

xargmin norm

argmin norm

argmin norm

argmin norm

A-IHRB CN

GF − (comb)

GF + (comb)

Figure 7.5: Variable node architecture details.

in either algorithm. The only requirement is for the symbols in the A-IHRB algorithm to be

converted into a “one-hot” vector format, where the values in the vector are all zero, except

for the element indicated by the input symbol where the value is δ (In our implementation,

we choose δ = 1). The argmin function of the A-IHRB, when implemented in a binary tree

fashion, will require each stage of two-input minimums to pass forward the minimum value

as well as the index. Therefore, the minimum and argmin functions can also share most of

the computational cost.

The decoding algorithm also requires multiplications in GF(8) when messages are passed

between the check and variable nodes (Equations (2.27), (2.28) and so on). In the Min-Max

path, LLR vectors are multiplied by some GF element by means of a series of multiplexors.

In the A-IHRB path, the GF(8) multiplication is a simple combinational logic (consisting

primarily of XOR gates).

93

The final variable node architecture is shown in Figure 7.5. The two data paths for A-

IHRB and Min-Max are separated by muxes, and as mentioned, the core computations of

summation and argmin/normalization are shared. In the A-IHRB path, GF multiplications

(and divisions) are implemented using combinational logic, and the symbols are converted

into a vector format to feed into the summation blocks. In the Min-Max path, the GF

multiplications are permutations, and the signals in the logarithmic quantization domain

(Chapter 4) are converted into the uniform quantization domain so that numbers may be

added (and the adders may be shared). These converters are placed as close to the adders as

possible so that everything outside (muxes for GF multiplication by permutation in the Min-

Max path, the wiring overhead, storage requirements, etc.) can be directly reduced by the

number of bits required for the representation of LLRs. The variable node also must compute

the a posteriori LLRs to make hard decisions as output of the decoder, as well as to check

the syndrome for early termination. This is shown in the top path of Figure 7.5. The “LLR

ch” signal in the Min-Max path is the channel information given as input to the algorithm

itself. The “LLR ch mod” signal in the A-IHRB path is the set of signals corresponding

to Q(k−1)m,n (a) in Equation (6.8) as well as the a posteriori LLR for hard decisions (refer to

Section 6.1.1 for details).

7.2.2 A-IHRB Check Node Logic Implementation in Variable Node

The combinational GF(8) subtraction and addition blocks in the A-IHRB data path in Figure

7.5 are for check node computations in A-IHRB. The check node computations in A-IHRB,

described by Equation (6.5), are actually simple bit-wise XOR operations on GF symbols.

Thus, instead of computing Equation (6.5) for each output edge on a check node, we can

instead compute a check node “state” as follows:

S(k)m =

GF(q)∑n′∈Im

(hm,n′ × Q

(k−1)m,n′

), (7.2)

94

from which the output message for each edge can then be calculated:

R(k)m,n = h−1

m,n ×(S(k)m +

(hm,n × Q(k−1)

m,n

)), (7.3)

effectively reducing the number of XOR operations necessary, especially for a large check node

degree dc. As the variable node computations are conducted in blocks, the Qm,n′ become

available as outputs. Therefore, the cumulative summation in Equation (7.2) for the state

S(k)m can be computed one at a time, at the same time as the variable node computations.

Additionally, if the state from the previous iteration S(k−1)m is stored, the check node to

variable node messages R(k)m,n (in Equation (6.5)) can also be calculated on the fly by Equation

(7.3). Since these are relatively simple combinational logic operations, we propose to absorb

these operations into the variable node computation itself, as shown in Figure 7.5. While this

slightly elongates the critical path in this architecture, we eliminate the need for a separate

phase for the check node computations in A-IHRB. The feedback nature of the data flow

also disallows pipelining this architecture. However, only one clock cycle is necessary for a

block of variable node computations, including the absorbed check node computations. This

allows a high-throughput for the partially parallel architecture.

7.2.3 Decoder Core Architecture

The overall architecture of the core is depicted in Figure 7.6. The variable node computations

include the A-IHRB check node operations, as explained in the previous section. Thus, in

the initial decode attempt by A-IHRB, decoding occurs purely within the variable node

cores and memories. A total of 58 variable node cores are instantiated for high parallelism

and high throughput. The local memories to each variable node stores data that does not

traverse the edges of the Tanner graph, such as Q(k)m,n(a) in Equation (6.9) for A-IHRB. These

signals do not need to pass through the barrel shifters because the same variable node core

will be the only core that requires these signals for computation. On the other hand, signals

that do traverse through the edge in the Tanner graph (to reach check nodes) are passed

95

...

Ch. Info

Mem 0

Ch. Info

Mem 1 Local Mem 1

Ch. Info

Mem 57

Ba

rrel

Sh

ift

/ R

ever

se B

arr

el S

hif

tLocal Mem 57

Me

ssag

e M

em

ory

0M

ess

age

Me

mo

ry 1

Me

ssag

e M

em

ory

57

Mu

x /

Dem

ux

Min

-Ma

x (M

M)

Ch

eck

No

de

(CN

) 0

Min

-Ma

x (M

M)

Ch

eck

No

de

(CN

) 1

Codeword & Syndrome Check

Ch. Input LLR

Decoder Output

... ...

Local Mem 0

Variable Node (VN) 1

Variable Node (VN) 57

Variable Node (VN) 0

Figure 7.6: Decoder core overall architecture.

through barrel shifters and stored in the message memories.

When A-IHRB fails the decoding attempt, the Min-Max check node cores are invoked.

For the purposes of this proof-of-concept chip, only two Min-Max check node cores are

instantiated. This choice does not greatly affect the overall (average) throughput since these

cores are invoked very rarely anyways, but the worst-case latency is highly degraded. Thus,

in a real application, the number of Min-Max check node cores must be chosen (along with

other parameters, such as the maximum number of iterations) such that the worst-case

latency requirements are met.

The channel input LLR memories are initialized at the beginning of each frame decode

attempt. The codeword and syndrome are checked each iteration for early termination, as

well as for collecting statistics on error rates. Serial-to-parallel and parallel-to-serial blocks

are placed at the periphery of the core as interface to the outside world.

96

Tausworthe PRNG

Bit

Sh

uff

ler

Tausworthe PRNG

round/sat

2rEb

N0

No

rmal

ize

to m

inim

um

Tausworthe PRNG

seed

. . .

32

Σ

8

round/sat

round/sat

round/sat

round/sat

round/sat

round/sat

round/sat

Σ

Σ

. . .

0

x58

quantization mode select

6

(scaling)

Figure 7.7: AWGN generator block details.

7.3 AWGN Channel Simulation

To have an on-chip AWGN simulation platform to measure frame error rates (without having

a high-speed interface between the chip and the outside world), an AWGN generator block

was placed outside the NB-LDPC decoder core. The details of this block are depicted in

Figure 7.7. Uniform random numbers are generated massively in parallel by Tausworthe

pseudorandom number generators [49]. The Tausworthe pseudorandom number generator

pseudocode is given in Algorithm 7.1. The possible values for Q and S are given in Table

7.1. The generator is implemented in hardware as shown in Figure 7.8.

The Tausworthe PRNGs generate uniformly distributed pseudo-random numbers. To

generate a normal distribution, the uniformly distributed numbers are grouped into eight

97

1 taus () {2 b1 = (((state1 << Q1) ˆ state1) >> (31 - S1));

3 state1 = (((state1 & 32’hFFFFFFFE ) << S1) ˆ b1);

4 b2 = (((state2 << Q2) ˆ state2) >> (29 - S2));

5 state2 = (((state2 & 32’hFFFFFFFC ) << S2) ˆ b2);

6 b3 = (((state3 << Q3) ˆ state3) >> (28 - S3));

7 state3 = (((state3 & 32’hFFFFFFF8 ) << S3) ˆ b3);

8 }

Algorithm 7.1: pseudocode for Tausworthe PRNG.

Table 7.1: Possible Constant Values for the Tausworthe PRNG

Q1 Q2 Q3 S1 S2 S3

13 2 3 12 4 17

7 2 9 24 7 11

3 2 13 20 16 7

and added together, which by the central limit theorem generates a distribution that resem-

bles a Gaussian. The output of this is scaled properly to produce a normal distribution with

zero mean and variance one. Then, the signal-to-noise ratio is used to transform the Gaus-

sian distributed random variable into a log-likelihood ratio that represents the information

received from the observation from the channel. Three of these LLRs are taken at a time

and combined properly to generate a vector of 8 LLRs, since the NB-LDPC code is defined

over GF(8). This LLR vector is normalized so that the minimum LLR in the vector is 0.

Finally, the LLRs are quantized to the correct number of bits, with the correct decimal point

location and fed into the channel input LLR memory of the decoder core.

7.4 Chip Design

The general task flow is depicted in Figure 7.9. The parity-check matrix is generated by

a script in MATLAB and is used in a software implementation of the decoding algorithm

in C++ to observe the error rate performance in simulation. The C++ implementation is

98

i_load

state1

<< Q1>> (31-S1)

<< S1A1

b1

1

0

state2

<< Q2>> (29-S2)

<< S2A2

b2

1

0

state3

<< Q3>> (28-S3)

<< S3A3

b3

1

0

Figure 7.8: Tausworthe PRNG block details.

written in such a way that it can also be used for testing and verifying the Verilog implemen-

tation for bit-accuracy. The core architecture is written in Verilog from scratch, but some of

the blocks required for error rate testing (mostly the AWGN generation) that reside outside

the decoder core were developed with the Synphony Model Compiler [50] environment to

reduce design time. After behavioral level verification of the Verilog, the decoder core and

the test circuitry were pushed through logical synthesis and physical place and route (PnR)

flows individually. This was to place the decoder core and test circuitry in separate power

domains, so that the power consumption of the decoder core could be measured by itself.

Then, the designed blocks can be verified with the gate-level Verilog and the associated

timing information using standard delay format (SDF) annotation. These blocks are then

99

Decoding Algorithm in C++

Parity-Check Matrix Generation in MATLAB

Architecture Implementation in

Verilog

Test Environment Implementation in

Synphony & Verilog

Top Level Testbench in Verilog (Modelsim)

Logical Synthesis in Design Compiler

Logical Synthesis in Design Compiler

Physical PnR in Encounter

Physical PnR in Encounter

Top Level Synthesis & PnR in Design

Compiler & Encounter

Testbench in Verilog (Modelsim)

DRC, LVS, Integration in Virtuoso

Testbench in Verilog (Modelsim)

with SDF annotation

Top Level Testbench in Verilog (Modelsim) with SDF annotation

VerificationDecoder Core DesignTest Circuitry Design

Figure 7.9: General task flow for chip tapeout.

combined with a top-level synthesis and PnR step, which again is verified to make sure that

there are no setup or hold timing violations. Finally, the “gds” file is generated for tapeout

100

in the Cadence Virtuoso environment.

7.5 Measurement Results

The proof-of-concept chip is taped out in a 40nm LP process. Since our bottleneck is in

the speed/throughput rather than energy consumption, the low threshold voltage (LVT)

flavor of transistors and standard cell library are used. The lab test setup is depicted in

Figure 7.10. A custom printed circuit board (PCB) is fabricated and assembled with off-

the-shelf components such as low voltage differential signaling (LVDS) drivers, low-dropout

regulators (LDO), etc. to power the chip-under-test and to interface to the packaged chip. A

zero insertion force (ZIF) socket is used so that multiple chips can be tested using the same

PCB. The chips are wire bonded in a pin grid array (PGA) package. Although the electrical

characteristics of these packages are not optimal for high-speed communications, the chip

contains a serial-parallel interface in the test circuitry, and therefore any communication

between the chip and the outside world could be conducted slowly and the high decoder

throughput could be recorded within the chip itself. The ROACH board [51] based on a

Virtex-5 FPGA [52] that was developed by the CASPER [53] research group is used to

communicate to the chip serially to program the chip and to read out decoding results and

statistics. Communication to the ROACH board is conducted through the KATCP [54]

protocol.

The decoder core measures 1782µm×2655µm with an area of 4.73mm2, while the AWGN

generation block and serial interface outside of the core is 279µm × 2655µm with an area

of 0.74mm2 (Figure 7.11, the pad ring is open because the core area is shared with other

research projects).

7.5.1 Error Correction Performance

The on-chip AWGN and corresponding LLR generation is utilized for frame error rate sim-

ulations (after functional verification). The measured frame error rate is plotted in Figure

101

Xilinx Virtex-5 based FPGA

Board

PCB with test chip

Xilinx Virtex-5 based FPGA

Boardwire bonded

chip under test

PGA package

ZIF socket

Figure 7.10: Lab test setup for chip measurement.

7.12. The “raw” frame error rate of the A-IHRB is also measured and plotted. The maximum

number of iterations in A-IHRB is set to be 8. The maximum iterations for the Min-Max

algorithm is varied between 5 and 8. The frame error rate curve under these conditions do

not show an error floor until ∼ 10−8, which, since each frame is 9396 coded bits long, is

equivalent to a bit error rate of about 3 or 4 orders of magnitude lower (we unfortunately

had not included a method to measure the bit error rate on the chip).

The average number of iterations corresponding to the FER performance in Figure 7.12

is measured and plotted in Figure 7.13. For the Min-Max curves, only frames that invoke

the Min-Max algorithm are counted (otherwise, the average number of iterations per frame

would be a small fraction). Because errors are such a rare event, the average number of

iterations per frame for the Min-Max algorithm tend to converge to a single line regardless

of the maximum number of iterations.

102

Decoder Core

AWGN & Serial Interface

1782µm

2655µm

Figure 7.11: Chip micrograph.

7.5.2 Hardware Performance

The hardware performance such as the average throughput and energy efficiency depends

mostly on the frame error rate of A-IHRB, and is relatively unaffected by changing the

maximum number of iterations of Min-Max. The operating principle that we have mentioned

makes this clear; the complex Min-Max algorithm is only invoked at the detectable frame

error rate of A-IHRB algorithm. Based on Figure 7.12, we choose to make chip measurements

at Eb

N0= 5dB, with six maximum Min-Max iterations, which achieves an A-IHRB frame error

rate of 10−2 and an overall frame error rate of 10−8. This decision is somewhat arbitrary, and

for example, the maximum number of iterations in the Min-Max algorithm can be tuned for

the required combination of frame error rate and worst-case latency. Our parameter choices

are more driven by demonstrability of our technique in correspondence with the FER curve,

and not so much the absolute performance itself. For example, the hardware can exhibit

similar characteristics even as the maximum number of iterations of Min-Max is increased

103

100

10-2

10-4

10-6

10-8

Fram

e E

rro

r R

ate

Eb/N0 (dB)4.5 5 5.53 3.5 4

max iter = 8max iter = 6max iter = 5A-IHRB

Figure 7.12: Frame error rate chip measurements. One hundred frame errors are observed

for each plotted point.

and the FER is simply unobservable. The fact that the FER is too low to be simulated has

nothing to do with the average throughput and energy efficiency.

With these parameters, at a nominal supply voltage of 1.2V, we achieve 2.551 Gbps

coded throughput, which is equivalent to a 2.267 Gbps information throughput. The clock

frequency is 125MHz, and the core power consumption is 212.4mW. This operating point

gives us an energy efficiency of 93.7pJ/bit. The target clock frequency during the logical

synthesis and physical place-and-route design stages was slightly higher (∼ 160MHz), but

several factors have caused a diminished maximum possible clock frequency. First, we know

that the chips received from the foundry are at the slow-slow (SS) corner, by observing the

frequency versus supply voltage curve of the ring oscillator circuits placed at the periphery

of the chip. Also, the power distribution of the large core is problematic and the center of

the core potentially sees a relatively large supply drop to a non-negligible resistance.

To improve the energy efficiency of energy per decoded bit, we can scale the supply

104

Eb/N0 (dB)4.5 5 5.53 3.5 4

max iter = 8max iter = 6max iter = 5A-IHRB

7

5

3

2

Ave

rage

Nu

mb

er

of

Ite

rati

on

s p

er

Fram

e

4

6

8

Figure 7.13: Average number of iterations per frame chip measurements.

voltage and operating clock frequency (Figure 7.14). An information throughput of 2.086

Gbps is maintained at a supply of 1.1V and a clock frequency of 115 MHz, where the power

consumption is 143 mW and the energy efficiency is improved to 68.6pJ/bit. An information

throughput of 1.088 Gbps is achieved at a supply of 0.8V and a clock frequency of 60 MHz,

where the power consumption is 38.72 mW and the energy efficiency is further improved to

35.6pJ/bit. At the extreme, the chip is functional at a minimum supply voltage of 0.65V

and a clock frequency of 30 MHz (below this supply voltage, the chip fails to function at any

clock frequency), where the throughput is 544.2 Mbps and the power consumption is 12.35

mW, resulting in a minimum energy efficiency of 22.7pJ/bit. These measurement results are

fairly consistent across multiple chips (Figure 7.15). The results of Figure 7.14 are based on

the chip with worst case measurements (CHIP1 in Figure 7.15).

105

40 60 80 100 120

Thro

ug

hp

ut

(Gb

ps)

Po

wer

(m

W)

Operating Frequency (MHz)

0

100

200

0

1

2

20

40 60 80 100 12020Operating Frequency (MHz)

1.2V93.7pJ/b

1.1V68.6pJ/b

1.0V59.5pJ/b0.9V

47.8pJ/b0.8V35.6pJ/b

0.7V26.2pJ/b

0.65V22.7pJ/b

Figure 7.14: Throughput and power versus operating frequency, with points labeled with

supply voltage and energy efficiency.

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130

1.2 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N

1.1 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N

1 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N

0.9 Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N N N N N

0.8 Y Y Y Y Y Y Y Y Y N N N N N N N N N N N N N N

0.7 Y Y Y Y Y N N N N N N N N N N N N N N N N N N

0.65 Y Y Y N N N N N N N N N N N N N N N N N N N N

0.6 N N N N N N N N N N N N N N N N N N N N N N N

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130

1.2 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N

1.1 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N

1 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N

0.9 Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N N N N N

0.8 Y Y Y Y Y Y Y Y Y Y N N N N N N N N N N N N N

0.7 Y Y Y Y Y Y N N N N N N N N N N N N N N N N N

0.65 Y Y Y N N N N N N N N N N N N N N N N N N N N

0.6 N N N N N N N N N N N N N N N N N N N N N N N

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130

1.2 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N

1.1 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N

1 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N

0.9 Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N N N N N

0.8 Y Y Y Y Y Y Y Y Y N N N N N N N N N N N N N N

0.7 Y Y Y Y Y N N N N N N N N N N N N N N N N N N

0.65 Y Y Y N N N N N N N N N N N N N N N N N N N N

0.6 Y N N N N N N N N N N N N N N N N N N N N N N

Sup

ply

Vo

ltag

e (V

)

CHIP3

Operating Frequency (MHz)

Operating Frequency (MHz)

Operating Frequency (MHz)

Sup

ply

Vo

ltag

e (V

)Su

pp

ly V

olt

age

(V)

CHIP1

CHIP2

Figure 7.15: Shmoo plot measurements for three received chips.

106

7.5.3 Comparison Against Prior Art

A comparison of our ASIC results with prior art is shown in Figure 7.16. In somewhat

older works (such as [20]), code parameters tended not to be sacrificed for cheaper hardware.

Therefore, the column weight is high (4) but there exists a multiple-orders-of-magnitude

gap in the information throughput and energy efficiency of these works and binary LDPC

decoders (see, for example, [10]). In more recent works such as [45] and [23], decoders for

higher order GF(q) codes are achieved with moderate throughputs. However, the power

consumption and correspondingly the energy efficiency is still quite poor. Furthermore,

as discussed in Section 7.1, code parameters such as the code length and column weight

are sacrificed to achieve the impressive hardware performance, which is a poor tradeoff

against simply choosing a binary LDPC decoder. Our code boasts a high code length,

moderate column weight, and high code rate, which outperforms the binary LDPC with

similar characteristics. The hardware is able to achieve very high information throughputs

while consuming very little power, yielding a highly energy efficient design. This work is

the first in NB-LDPC decoders to achieve greater than 2 Gbps information throughput

and less than 100 pJ/b energy efficiency. Without taking into account technology scaling

(which is difficult to estimate in deep sub-micron technologies), our work is a 5.2x and

23.8x improvement in throughput and energy efficiency over [45], and a 3.7x and 64.7x

improvement in throughput and energy efficiency over [23]. While there still remains some

gap between our work and binary LDPC decoders in terms of hardware performance metrics,

this certainly brings NB-LDPC code decoders much closer to practical deployment.

107

This work

Code Length (symbols, bits)

(dv, dc)

Code Rate

Galois Field Order q

DecodingAlgorithm

Technology

Core Area (mm2)

Throughput (Mb/s)

Power (mW)

Area Efficiency (Mb/s/mm2)

[23][20]

Information Throughput (Mb/s)

Energy Efficiency (pJ/b)

Clock Frequency (MHz)

Utilization

Sel.-Input Min-Max

160, 960

(2, 4)

0.5

GF(64)

Truncated EMS

65nm

7.04

87 %

400

698

349

729

49.57

2089

248, 1240

(4, 8)

0.55

GF(32)

90nm

10.33

-

260

47.69

26.2

479.8

2.54

18313

3132, 9396

(3, 27)

0.89

GF(8)

A-IHRB & Log Min-Max

40nm

4.73

89.5 %

30

612

544

12.35

115.1

22.7

[45]

RTBCP

110, 880

(2, 10)

0.8

GF(256)

28nm

1.289

75.7 %

520

546

436.8

976

338.87

2234

Code Length (Info bits) 480682 8352704

Supply Voltage VDD (V) 0.675- 0.65 1.2

125

2551

2267

212.4

479.4

93.7

700

1221

611

3704

86.79

6062

1.0-

Figure 7.16: Table of chip measurement results and comparison with prior art (no tech-

nology scaling).

108

CHAPTER 8

Conclusion

8.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

109

8.1 Research Contributions

Specific accomplishments of this research are:

· Proposed several algorithmic simplifications and analyzed their effects:

– The Pruned Min-Max algorithm, which was shown to have negligible impact on

the performance and directly alleviate hardware implementation costs for lower

rate codes.

– The A-IHRB algorithm, which in combination with a soft decoding algorithm is

highly amenable to hardware implementation which results in a high throughput

and high energy efficiency.

· Analyzed the effect of the quantization scheme on the decoding performance of NB-

LDPC codes, and identified the effects of limiting the number of bits used to represent

LLRs.

· Proposed the Logarithmic quantization scheme, which is easily applicable to any exist-

ing soft decoder and yields large (up to 40% in our comparison) hardware cost reduction

without paying a price in the coding gain.

· Created a multi-purpose FPGA platform:

– The Min-Max algorithm, implemented in hardware, accelerates frame error rate

simulations enabling the observation of much lower frame error rates relative to

software simulations.

– The logarithmic quantization scheme is applied, demonstrating both the hardware

resource savings as well as the negligible performance degradation (especially

important to observe at lower frame error rates).

– The FPGA platform features a flexible script-based RTL generation scheme, en-

abling the demonstration of FER simulation results as well as hardware resource

utilization reports across a variety of parity-check matrix designs.

110

· Fabricated a proof-of-concept ASIC chip. The maximum information throughput

achieved was 2.267 Gbps at 1.2V supply and 125MHz clock, where the power con-

sumption was 212.4mW. The energy efficiency at this operating point was 93.7pJ/bit.

The throughput and energy efficiency are 3.7x and 23.8x improvements respectively

from state-of-the-art. The energy efficiency can further be improved by scaling the

supply voltage down to 0.65V, where the information throughput is 544.2 Mbps and

the power consumption is 12.35mW, yielding an energy efficiency of 22.7pJ/bit.

8.2 Future Work

Areas of potential future work include:

· Application of algorithmic techniques to spatially coupled NB-LDPC codes. While the

proposed techniques such as the logarithmic quantization scheme should seamlessly

integrate with spatially coupled codes and decoders such as windowed decoders, this

has not been explicitly verified or investigated. It will also be worthwhile to investigate

the hardware implications of our techniques in this context.

· Alternatives to the A-IHRB algorithm. The A-IHRB as it is is fairly simple and it

is not likely that a great performance improvement can be made without introducing

too much complexity. However, it is possible that the “coarse” decoder can play an

increased role in the dual-algorithm decoding scheme. For example, could the channel

input LLR vector be “preprocessed” so that the soft decoding algorithm improves

its performance? Although not discussed in this manuscript, we did investigate this

possibility and found that it is not trivial to find such a “preprocessing” scheme without

introducing high complexity or a coding gain degradation (sometimes in the form of

an early error floor, which can be missed). However, there is still the possibility of an

opportunity for improvement in this regard.

· Further improvement of ASIC specifications. While we have made large advancements

in terms of the information throughput, energy efficiency per decoded bit, etc., there

111

is still a gap in the performance relative to binary LDPC code decoders.

– One major bottleneck is the relatively long critical path in the proposed variable

node unit, slowing down the overall clock frequency of the decoder. Somehow

reducing the logic depth or allowing pipelining (although these would not be

trivial) would be of high interest for our architecture.

– Another high-cost element is the storage of messages. The memory elements

take a large amount of real estate on silicon, and this cost only increases as the

code length increases (to 2K and 4K byte codewords). Techniques to reduce this

storage requirement will further help reduce the decoder cost.

· Consideration of rate-programmability. The decoder architecture design for rate-

programmable NB-LDPC decoders for storage is an open question.

112

References

[1] R. G. Gallager, “Low-Density Parity-Check Codes,” Ph.D. dissertation, Cambridge,MA: MIT Press, 1963. 2

[2] R. Hamming, “Error detecting and error correcting codes,” The Bell Systems TechnicalJournal, vol. 29, no. 2, pp. 147–150, Apr. 1950. 2

[3] R. Bose and D. Ray-Chaudhuri, “On a class of error correcting binary group codes,”Information and Control, vol. 3, pp. 68–79, Mar. 1960. 2

[4] I. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of theSociety for Industrial and Applied Mathematics, vol. 8, pp. 300–304, June 1960. 2

[5] IEEE Standard for Information Technology - Telecommunications and Information Ex-change Between Systems-Local and Metropolitan Area Networks-Specific RequirementsPart 3: Carrier Sense Multiple Access With Collision Detection (CSMA/CD) AccessMethod and Physical Layer Specifications, IEEE Std. 802.3an Std., 2006. 3, 62

[6] ETSI Standard TR 102 376 V1.1.1: Digital Video Broadcasting (DVB) User Guidelinesfor the Second Generation System for Broadcasting, Interactive Services, News Gather-ing and Other Broadband Satellite Applications (DVB-S2), ETSI Std. TR 102 376 Std.,2005. 3

[7] IEEE Standard for Local and Metropolitan Area Networks Part 16: Air Interface forFixed and Mobile Broadband Wireless Access Systems Amendment 2: Physical andMedium Access Control Layers for Combined Fixed and Mobile Operation in LicensedBands and Corrigendum 1, IEEE Std. 802.16e Std., 2006. 3, 62

[8] IEEE Standard for Information Technology: Telecommunications and Information Ex-change between Systems: Local and Metropolitan Area Networks: Specific RequirementsPart 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY)Specifications Amendment 4: Enhancements for Very High Throughput for Operationin Bands below 6 GHz., IEEE Std. 802.11ac-2013 Std., 2013. 3, 62

[9] C.-H. Liu et al., “An LDPC decoder chip based on self-routing network for IEEE 802.16eapplications,” IEEE J. Solid-State Circuits, vol. 43, no. 3, pp. 684–694, Mar. 2008. 3,32

[10] Z. Zhang, V. Anantharam, M. J. Wainwright, and B. Nikolic, “An Efficient 10GBASE-TEthernet LDPC Decoder Design With Low Error Floors,” IEEE J. Solid-State Circuits,vol. 45, no. 4, April 2010. 3, 107

[11] M. Mansour and N. Shanbhag, “High-throughput LDPC decoders,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 11, no. 6, pp. 976–996, Dec 2003. 3, 44

[12] K. Zhang, X. Huang, and Z. Wang, “A high-throughput LDPC decoder architecturewith rate compatibility,” IEEE Trans. Circuits Syst. I, Reg Papers, vol. 58, no. 4, pp.839–847, Apr. 2011. 3, 32

113

[13] T. Mohsenin, D. Truong, and B. Baas, “A low-complexity message-passing algorithmfor reduced routing congestion in LDPC decoders,” IEEE Trans. Circuits Syst. I, RegPapers, vol. 57, no. 5, pp. 1048–1061, May 2010. 3, 32, 45

[14] D. Declercq and M. Fossorier, “Decoding Algorithms for Nonbinary LDPC Codes OverGF(q),” IEEE Trans. Commun., vol. 55, no. 4, pp. 633–643, April 2007. 3, 92

[15] A. Voicila et al., “Low-complexity decoding for non-binary LDPC codes in high orderfields,” IEEE Trans. Commun., vol. 58, no. 5, pp. 1365–1375, May 2010. 3, 45, 92

[16] M. Davey and D. MacKay, “Low-density parity check codes over GF(q),” IEEE Com-mun. Lett., vol. 2, no. 6, pp. 165–167, June 1998. 3, 12

[17] A. Hareedy, B. Amiri, R. Galbraith, and L. Dolecek, “Non-Binary LDPC Codes forMagnetic Recording Channels: Error Floor Analysis and Optimized Code Design,”IEEE Transactions on Communications, vol. 64, no. 8, pp. 3194–3207, Aug 2016. 3, 91

[18] L. Barnault and D. Declercq, “Fast decoding algorithm for LDPC over GF(2q),” inProc. Inform. Theory Workshop, Mar. 2003, pp. 70–73. 3, 14, 15

[19] X. Chen, S. Lin, and V. Akella, “Efficient configurable decoder architecture for non-binary quasi-cyclic LDPC codes,” IEEE Trans. Circuits Syst. I, Reg Papers, vol. 59,no. 1, pp. 188–197, Jan. 2012. 3, 20, 23, 31, 32, 45

[20] Y.-L. Ueng et al., “An efficient layered decoding architecture for nonbinary QC-LDPCcodes,” IEEE Trans. Circuits Syst. I, Reg Papers, vol. 59, no. 2, pp. 385–398, Feb. 2012.3, 20, 23, 27, 31, 32, 107

[21] X. Zhang and F. Cai, “Reduced-complexity decoder architecture for non-binary LDPCcodes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 7, pp. 1229–1238, July 2011. 3, 20, 23

[22] X. Chen and C.-L. Wang, “High-throughput efficient non-binary LDPC decoder basedon the simplified min-sum algorithm,” IEEE Trans. Circuits Syst. I, Reg Papers, vol. 59,no. 11, pp. 2784–2794, Nov. 2012. 3, 20, 23, 34, 45

[23] Y. S. Park, Y. Tao, and Z. Zhang, “A Fully Parallel Nonbinary LDPC Decoder WithFine-Grained Dynamic Clock Gating,” IEEE J. Solid-State Circuits, vol. 50, no. 2, pp.464–475, Feb. 2015. 3, 20, 23, 63, 89, 90, 92, 107

[24] F. Garcıa-Herrero, M. J. Canet, and J. Valls, “Nonbinary LDPC Decoder Based onSimplified Enhanced Generalized Bit-Flipping Algorithm,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 22, no. 6, pp. 1455–1459, June 2014. 3, 20, 23

[25] X. Zhang, F. Cai, and S. Lin, “Low-complexity reliability-based message-passing de-coder architectures for non-binary LDPC codes,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 20, no. 11, pp. 1938–1950, Nov. 2012. 3, 76, 77, 78

[26] V. Savin, “Min-Max decoding for non binary LDPC codes,” in Proc. IEEE Int. Symp.Inf. Theory, July 2008, pp. 960–964. 3, 11, 17, 18

114

[27] C. Spagnol, E. M. Popovici, and W. P. Marnane, “Hardware Implementation of GF(2m)LDPC Decoders,” IEEE Trans. Circuits Syst. I, Reg Papers, vol. 56, no. 12, pp. 2609–2620, Dec 2009. 19, 92

[28] Y. Toriyama, B. Amiri, L. Dolecek, and D. Markovic, “Field-order based hardware costanalysis of non-binary LDPC decoders,” in Proc. IEEE Asilomar Conf. Signals, Syst.,and Comput., Nov 2014, pp. 2045–2049. 25

[29] Y. S. Park, Y. Tao, and Z. Zhang, “A 1.15Gb/s fully parallel nonbinary LDPC decoderwith fine-grained dynamic clock gating,” in Proc. IEEE Int. Solid-State Circuits Conf.Dig. Tech. Papers (ISSCC), Feb. 2013, pp. 422–423. 27, 32, 45

[30] J. Lin, J. Sha, Z. Wang, and L. Li, “Efficient decoder design for nonbinary quasicyclicLDPC codes,” IEEE Trans. Circuits Syst. I, Reg Papers, vol. 57, no. 5, pp. 1071–1082,May 2010. 31

[31] B. Amiri, S. Srinivasa, and L. Dolecek, “Quantization, absorbing regions and practicalmessage passing decoders,” in Proc. IEEE Asilomar Conf. Signals, Syst., and Comput.,Nov. 2012, pp. 1255–1259. 36

[32] B. Amiri, J. Kliewer, and L. Dolecek, “Analysis and enumeration of absorbing sets fornon-binary graph-based codes,” in Proc. IEEE Int. Symp. Inform. Theory (ISIT), July2013, pp. 398–409. 37, 50

[33] Y. Toriyama, B. Amiri, L. Dolecek, and D. Markovic, “Logarithmic quantization schemefor reduced hardware cost and improved error floor in non-binary LDPC decoders,” inIEEE Glob. Commun. Conf. (GLOBECOM), Dec 2014, pp. 3162–3167. 44

[34] J. Lin and Z. Yan, “Efficient Shuffled Decoder Architecture for Nonbinary Quasi-CyclicLDPC Codes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 9, pp.1756–1761, Sept 2013. 45

[35] X. Zhang and P. H. Siegel, “Quantized min-sum decoders with low error floor for LDPCcodes,” in Proc. IEEE Int. Symp. Inform. Theory (ISIT), July 2012, pp. 2871–2875. 45

[36] X. Zhang and P. H. Siegel, “Will the Real Error Floor Please Stand Up?” in Int. Conf.Sig. Proc. Comm. (SPCOM), July 2012, pp. 1–5. 45

[37] “IEEE Standard for Floating-Point Arithmetic,” IEEE Std 754-2008, pp. 1–70, August2008. 54

[38] L. Lan et al., “Construction of Quasi-Cyclic LDPC Codes for AWGN and Binary ErasureChannels: A Finite Field Approach,” IEEE Trans. Information Theory, vol. 53, no. 7,pp. 2429–2458, July 2007. 61

[39] L. Zhang et al., “Quasi-Cyclic LDPC Codes: An Algebraic Construction, Rank Analysis,and Codes on Latin Squares,” IEEE Trans. Commun., vol. 58, no. 11, pp. 3126–3139,November 2010. 61

115

[40] X.-Y. Hu, E. Eleftheriou, and D. M. Arnold, “Progressive Edge-Growth TannerGraphs,” in IEEE Glob. Telecommun. Conf. (GLOBECOM), Nov 2001, pp. 995–1001.62

[41] C. Y. Chen, Q. Huang, C. C. Chao, and S. Lin, “Two low-complexity reliability-basedmessage-passing algorithms for decoding non-binary ldpc codes,” IEEE Trans. Com-mun., vol. 58, no. 11, pp. 3140–3147, November 2010. 76

[42] M. P. C. Fossorier, “Quasicyclic low-density parity-check codes from circulant permu-tation matrices,” IEEE Trans. Information Theory, vol. 50, no. 8, pp. 1788–1793, Aug2004. 88

[43] J. Chen and M. P. C. Fossorier, “Density evolution for two improved BP-Based decodingalgorithms of LDPC codes,” IEEE Commun. Lett., vol. 6, no. 5, pp. 208–210, May 2002.89

[44] J. Zhao, F. Zarkeshvari, and A. H. Banihashemi, “On implementation of min-sum al-gorithm and its modifications for decoding low-density Parity-check (LDPC) codes,”IEEE Trans. Commun., vol. 53, no. 4, pp. 549–554, April 2005. 89

[45] J. Lin and Z. Yan, “An Efficient Fully Parallel Decoder Architecture for NonbinaryLDPC Codes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 12, pp.2649–2660, Dec 2014. 90, 92, 107

[46] B. Amiri, J. A. F. Castro, and L. Dolecek, “Design of non-binary quasi-cyclic ldpc codesby absorbing set removal,” in IEEE Information Theory Workshop (ITW), Nov 2014,pp. 461–465. 91

[47] A. Hareedy et al., “Non-Binary LDPC Code Optimization for Partial-Response Chan-nels,” in IEEE Glob. Commun. Conf. (GLOBECOM), Dec 2015, pp. 1–6. 88, 91

[48] A. Hareedy, C. Lanka, and L. Dolecek, “A General Non-Binary LDPC Code Opti-mization Framework Suitable for Dense Flash Memory and Magnetic Storage,” IEEEJournal on Selected Areas in Communications, vol. 34, no. 9, pp. 2402–2415, Sept 2016.88, 91

[49] P. L’Ecuyer, “Maximally Equidistributed Combined Tausworthe Generators,” AMSMathematics of Computation, vol. 65, pp. 203–213, 1996. 97

[50] “Synphony Model Compiler,” [Online]. Available: http://www.synopsys.com/Tools/Implementation/FPGAImplementation/Pages/synphony-model-compiler.aspx. 99

[51] “ROACH.” [Online]. Available: https://casper.berkeley.edu/wiki/ROACH. 101

[52] “Virtex-5 Family Overview.” [Online]. Available: http://www.xilinx.com/support/documentation/data sheets/ds100.pdf. 101

[53] “Collaboration for Astronomy Signal Processing and Engineering Research (CASPER).”[Online]. Available: https://casper.berkeley.edu. 101

116

[54] “KATCP.” [Online]. Available: https://casper.berkeley.edu/wiki/KATCP. 101

117