introducon implementaon conclusion - nvidia · evaluation of raptor codes on embedded...

1
Implementa)on of Raptor Code on GPU Linjia Hu 1 Todor Mladenov 2 Saeid Noshabadi 1 1 Department of Computer Science, Michigan Technological University, MI, USA 2 Department of InformaEon and CommunicaEon and Department of Nanobio Materials and Electronics, Gwangju InsEtute of Science and Technology, South Korea Raptor Code principle Raptor Code come as an improvement to LT-Code, which performs as close as possible to the Shannon’s channel limit and provides linear encoding and decoding time. It has been chosen for the forward error correction (FEC) scheme in 3GPP and DVB-H standards. The blow block diagram shows the systematic Raptor encoder and decoder. At the encoding side, d denotes the input vector of the Raptor encoder, and contains K source symbols and (S + H) zero symbols. d [0:L-1] = [z T t T ] T , where L = K + S + H Code Constrains Processor multiplies d with the inverse of the preprocessor matrix A to produce the intermediate symbols c. c [0:L1] =A 1 L×L d [0:L1] LT Encoder can generate any number of encoded symbols e, such that G LT c=e [0:N1] At the decoding side, we exchanges the positions of Code Constraints Processor and LT Encoder. c [0:L1] = [z T e’ T ] A 1 M*L and t [0:K1] =G LT c [0:L1] Matrix Inversion on CPU The code profiling of Raptor Code shows that the inversion of the preprocessor matrix A contributes more than 90% of the decoding time, so we concentrate on the optimization of matrix inversion algorithm. The most common matrix inversion algorithm is Gaussian Elimination (GE) , so we implement GE in Galois Field GF(2) on CPU as follows: Reference [1]T. Mladenov, S. Nooshabadi, and K. Kim, “Implementation and Evaluation of Raptor Codes on Embedded Systems," IEEE Transactions on Computers, Dec. 2011. [2]T. Mladenov, S. Nooshabadi, and K. Kim, “Strategies for the Design of Raptor Decoding in Broadcast/Multicast Delivery Systems," IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 423-428, May 2010 Performance Our test platform uses a 3.20 GHz Intel Core i7 quad-core CPU, a GeForce GTX 570 graphic card with 2.5 GB video memory, CUDA 4.0 and the Fedora 13 operating system. For large block size and symbol size, the speedup of PACKED WORD version decoding can approach 36, and the speedup of WORD version decoding can approach 46. Matrix Inversion on GPU We try to implement Raptor Codes on GPU for the purpose of processing large block and symbol size efficiently. Recommended by the 3GPP and DVB-H standard, the maximum block size is 8192, and maximum symbols size is 8192 bytes. For the large matrix, we use two types of data to memory mapping . One is “WORD,” which uses 32-bit word to store 1 bit matrix element, and the other is “PACKED WORD,” in which 32 matrix elements are packed together into a single 32-bit word. Conclusion IntroducEon ImplementaEon GLDPC(1..S) S H K (-1) GHalf(1..H) GLT(1..K) K S H ZSxH IH IS Code Constraints Processor N GLT(1..N) L LT Encoder GLDPC(1..S) S H N’ (-1) GHalf(1..H) GLT(1..N’) K S H ZSxH IH IS Code Constraints Processor K GLT(1..K) L LT Decoder d c e c t e' Channel Systematic Raptor Encoder Systematic Raptor Decoder z t Step%1: (kernel 1) find pivot row for forward reduc9on. If A[ i, j ] == 1, go to Step%3; otherwise, go to search the pivot row and record its row index for row exchange , go to Step%2. Column j Step%2. (kernel 2) exchange row i and row k in both A and D for forward reduc9on. One thread processes one pair of elements' exchange. Column j Row i 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 Row i Row k Step%3. (kernel 3) extract the current column and copy them to host host calculates the number of rows that need to do reduc9on, and their row indexes for forward reduc9on. Matrix A Vector D Matrix A Vector D Step%4. (kernel 4) Forward reduc9on from row i to all the rows that need to do reduc9on in both matrix A and D Send the row i to shared memory of each blocks that do forward reduc9on. Column j Row i 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 Row i Column j Step 5: (kernel 5) extract the current column(j) to host. The host calculate the number of rows that need to be subs9tuted and their row indexes. Step 6: (kernel 6) backward subs9tu9on from row i to all the rows above it that need to be subs9tuted just update vector D, the final updated vector D is the solu9on Matrix A Vector D Matrix A Vector D 1.53 2.81 5.39 10.30 20.28 36.07 0 5 10 15 20 25 30 35 40 256 512 1024 2048 4096 8192 Speedup Symbol Size (Byte) Speedup (CPU/GPU) of PACKED WORD Version with Large Block Size (K = 8192) 16.25 16.93 18.38 22.68 32.85 46.27 0 5 10 15 20 25 30 35 40 45 50 256 512 1024 2048 4096 8192 Speedup (CPU/GPU) of WORD Version with Large Block Size (K = 8192) Speedup Symbol Size (Byte)

Upload: others

Post on 27-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introducon Implementaon Conclusion - NVIDIA · Evaluation of Raptor Codes on Embedded Systems," IEEE Transactions on Computers, Dec. 2011. [2]T. Mladenov, S. Nooshabadi, and K. Kim,

Implementa)on  of  Raptor  Code  on  GPU  Linjia  Hu1                    Todor  Mladenov2                      Saeid  Noshabadi1                

 1Department  of  Computer  Science,  Michigan  Technological  University,  MI,  USA                                                    2  Department  of    InformaEon  and  CommunicaEon  and  Department  of  Nanobio  Materials  and  Electronics,  Gwangju  InsEtute  of  Science  and  Technology,  South    Korea  

Raptor Code principle Raptor Code come as an improvement to LT-Code, which performs as close as possible to the Shannon’s channel limit and provides linear encoding and decoding time. It has been chosen for the forward error correction (FEC) scheme in 3GPP and DVB-H standards. The blow block diagram shows the systematic Raptor encoder and decoder. At the encoding side, d denotes the input vector of the Raptor encoder, and contains K source symbols and (S + H) zero symbols.

d[0:L-1] = [zT tT]T , where L = K + S + H Code Constrains Processor multiplies d with the inverse of the preprocessor matrix A to produce the intermediate symbols c.

c[0:L-­‐1]  =  A-­‐1  L×L  �  d[0:L-­‐1]  LT Encoder can generate any number of encoded symbols e, such that

GLT  �  c  =  e[0:N-­‐1]  At the decoding side, we exchanges the positions of Code Constraints Processor and LT Encoder.

c[0:L-­‐1]  =  [zT  e’T]  �  A-­‐1  M*L        and        t[0:K-­‐1]  =  GLT�  c[0:L-­‐1]           Matrix Inversion on CPU The code profiling of Raptor Code shows that the inversion of the preprocessor matrix A contributes more than 90% of the decoding time, so we concentrate on the optimization of matrix inversion algorithm. The most common matrix inversion algorithm is Gaussian Elimination (GE) , so we implement GE in Galois Field GF(2) on CPU as follows:

Reference  [1]T. Mladenov, S. Nooshabadi, and K. Kim, “Implementation and Evaluation of Raptor Codes on Embedded Systems," IEEE Transactions on Computers, Dec. 2011. [2]T. Mladenov, S. Nooshabadi, and K. Kim, “Strategies for the Design of Raptor Decoding in Broadcast/Multicast Delivery Systems," IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 423-428, May 2010

Performance Our test platform uses a 3.20 GHz Intel Core i7 quad-core CPU, a GeForce GTX 570 graphic card with 2.5 GB video memory, CUDA 4.0 and the Fedora 13 operating system. For large block size and symbol size, the speedup of PACKED WORD version decoding can approach 36, and the speedup of WORD version decoding can approach 46.

GLDPC(1..S)S

H

K

(-1)

GHalf(1..H)

GLT(1..K)

K S H

ZSxHIH

IS

Code Constraints Processor

N GLT(1..N)

L

LT Encoder

GLDPC(1..S)S

H

N’

(-1)

GHalf(1..H)GLT(1..N’)

K S H

ZSxHIH

IS

Code Constraints Processor

K GLT(1..K)

LLT Decoder

d c

e

cte'

ChannelSystematic Raptor Encoder

Systematic Raptor Decoder

zt

Fig. 2. Block diagram of the systematic Hard Raptor Codes

allows for successful decoding when only the first K encodedsymbols (K is the number of source symbols in a block) havebeen received and no errors are detected in the channel.

The output vector e, containing N symbols, generated bythe encoder is received by the decoder across the channel asinput vector e!, containing N ! (K ! N ! ! N) encoded sym-bols (which may be nonconsecutive). Vector e! is padded withS +H zeroes to dimension it to (M = N ! +S +H). Startingwith (N ! " K), the value of N ! is iteratively incremented tomake the code constraints preprocessor matrix A invertible.The difference (N !"K) is equal to or greater than the numberof received encoded symbols lost in the channel. The decodingis performed according to (1) and (2), where GLT is a LTgenerator matrix with the dimension of K #L. All operationsare performed in Galois Field GF(2).

c[0:L"1] = [zT e!T ] · A"1M#L (1)

t[0:K"1] = GLT · c[0:L"1] (2)

At the decoder side, the submatrix GLT (1..N !) is firstbuilt from the input data. The sequence number of the nth

received encoded symbol is used to generate the nth row ofthe submatrix GLT (1..N !) through the LT encoding process.Further details can be found in [9], [3], [4].

III. MATRIX INVERSION ALGORITHMS

The most common matrix inversion algorithm is GaussianElimination (GE). It involves row exchange and row XORoperations. An example of GE for Galois Field GF(2) isshown on Fig. 3. The structure and simplicity of the algorithmallow for greater design freedom and optimization. That isvalid even to a greater extend when it comes to its hardwareimplementation. As we will show, GE benefits greatly fromthe proposed parallel scalable architecture.

The SA technique, proposed in [3], [4], on the other hand,aims at reducing the row XOR operations by introducingsome additional column exchange operations. It complicatesthe algorithm, but reduces the amount of memory accesses.An example of the SA algorithm is shown in Fig. 4.

SA algorithm has three phases. In Phase one, it tries toconvert as big portion of the matrix as possible, to an identity

Fig. 3. Example of GE algorithm

matrix, through a minimum number of row XOR operations.In Phase two, whatever portion that is left after this phase onthe right side of the original matrix, is split in two, and thelower part of it is converted into identity matrix, through theGaussian elimination. Phase three completes the inversion byreducing the matrix to the form of an identity matrix. The firstline of Fig. 4 shows the form of the matrix after each of thethree phases.

Fig. 4. Example of SA algorithm

IV. RESULTS AND DISCUSSION

This section compares and analyzes the performance, powerand energy dissipation of various implementations of system-atic Raptor decoder. Two separate implementations, for each ofSA and GE matrix inversion and vector decoding algorithms,are under consideration: a pure software - on the soft coreNIOS II processor platform, and a pure hardware - on theparallel scalable hardware architecture proposed in this paper,as show in Fig. 5. The encoded symbol vector is processedin parallel with the matrix to reduce the required memoryfootprint and to avoid the need for post matrix multiplications.

In Fig. 5 the Avalon switch fabric uses a slave port to allowthe NIOS II processor to configure the Control Registers, thatinitialize the hardware accelerator. During the initialization,the size and initial addresses for the matrix and the vectorsare set. The Control Registers also control the operations ofthe hardware, like initiating START and STOP commands.The hardware accelerator block uses an Avalon master portto access the whole memory mapped address space of theNIOS II processor, send interrupts to NIOS II and receiveinterrupts from other peripheral devices. For a more flexibledesign Row & Column Exchange Logic, and Row XOR Logic

3742

Authorized licensed use limited to: Kwangju Institute of Science and Technology. Downloaded on August 04,2010 at 08:35:38 UTC from IEEE Xplore. Restrictions apply.

Matrix Inversion on GPU We try to implement Raptor Codes on GPU for the purpose of processing large block and symbol size efficiently. Recommended by the 3GPP and DVB-H standard, the maximum block size is 8192, and maximum symbols size is 8192 bytes. For the large matrix, we use two types of data to memory mapping . One is “WORD,” which uses 32-bit word to store 1 bit matrix element, and the other is “PACKED WORD,” in which 32 matrix elements are packed together into a single 32-bit word.

Conclusion      IntroducEon     ImplementaEon    

GLDPC(1..S)S

H

K

(-1)

GHalf(1..H)

GLT(1..K)

K S H

ZSxHIH

IS

Code Constraints Processor

N GLT(1..N)

L

LT Encoder

GLDPC(1..S)S

H

N’

(-1)

GHalf(1..H)GLT(1..N’)

K S H

ZSxHIH

IS

Code Constraints Processor

K GLT(1..K)

LLT Decoder

d c

e

cte'

ChannelSystematic Raptor Encoder

Systematic Raptor Decoder

zt

Fig. 2. Block diagram of the systematic Hard Raptor Codes

allows for successful decoding when only the first K encodedsymbols (K is the number of source symbols in a block) havebeen received and no errors are detected in the channel.

The output vector e, containing N symbols, generated bythe encoder is received by the decoder across the channel asinput vector e!, containing N ! (K ! N ! ! N) encoded sym-bols (which may be nonconsecutive). Vector e! is padded withS +H zeroes to dimension it to (M = N ! +S +H). Startingwith (N ! " K), the value of N ! is iteratively incremented tomake the code constraints preprocessor matrix A invertible.The difference (N !"K) is equal to or greater than the numberof received encoded symbols lost in the channel. The decodingis performed according to (1) and (2), where GLT is a LTgenerator matrix with the dimension of K #L. All operationsare performed in Galois Field GF(2).

c[0:L"1] = [zT e!T ] · A"1M#L (1)

t[0:K"1] = GLT · c[0:L"1] (2)

At the decoder side, the submatrix GLT (1..N !) is firstbuilt from the input data. The sequence number of the nth

received encoded symbol is used to generate the nth row ofthe submatrix GLT (1..N !) through the LT encoding process.Further details can be found in [9], [3], [4].

III. MATRIX INVERSION ALGORITHMS

The most common matrix inversion algorithm is GaussianElimination (GE). It involves row exchange and row XORoperations. An example of GE for Galois Field GF(2) isshown on Fig. 3. The structure and simplicity of the algorithmallow for greater design freedom and optimization. That isvalid even to a greater extend when it comes to its hardwareimplementation. As we will show, GE benefits greatly fromthe proposed parallel scalable architecture.

The SA technique, proposed in [3], [4], on the other hand,aims at reducing the row XOR operations by introducingsome additional column exchange operations. It complicatesthe algorithm, but reduces the amount of memory accesses.An example of the SA algorithm is shown in Fig. 4.

SA algorithm has three phases. In Phase one, it tries toconvert as big portion of the matrix as possible, to an identity

Fig. 3. Example of GE algorithm

matrix, through a minimum number of row XOR operations.In Phase two, whatever portion that is left after this phase onthe right side of the original matrix, is split in two, and thelower part of it is converted into identity matrix, through theGaussian elimination. Phase three completes the inversion byreducing the matrix to the form of an identity matrix. The firstline of Fig. 4 shows the form of the matrix after each of thethree phases.

Fig. 4. Example of SA algorithm

IV. RESULTS AND DISCUSSION

This section compares and analyzes the performance, powerand energy dissipation of various implementations of system-atic Raptor decoder. Two separate implementations, for each ofSA and GE matrix inversion and vector decoding algorithms,are under consideration: a pure software - on the soft coreNIOS II processor platform, and a pure hardware - on theparallel scalable hardware architecture proposed in this paper,as show in Fig. 5. The encoded symbol vector is processedin parallel with the matrix to reduce the required memoryfootprint and to avoid the need for post matrix multiplications.

In Fig. 5 the Avalon switch fabric uses a slave port to allowthe NIOS II processor to configure the Control Registers, thatinitialize the hardware accelerator. During the initialization,the size and initial addresses for the matrix and the vectorsare set. The Control Registers also control the operations ofthe hardware, like initiating START and STOP commands.The hardware accelerator block uses an Avalon master portto access the whole memory mapped address space of theNIOS II processor, send interrupts to NIOS II and receiveinterrupts from other peripheral devices. For a more flexibledesign Row & Column Exchange Logic, and Row XOR Logic

3742

Authorized licensed use limited to: Kwangju Institute of Science and Technology. Downloaded on August 04,2010 at 08:35:38 UTC from IEEE Xplore. Restrictions apply.

Step%1:((kernel(1)(•  find(pivot(row(for(forward(

reduc9on.((•  If(A[(i,(j(](==(1,(go(to(Step%3;(•  otherwise,(go(to(search(the(pivot(

row(and(record(its(row(index(for(row((exchange(,(go(to(Step%2.(

Column(j�Step%2.((kernel(2)(•  exchange(row(i(and(row(k(in(both(A(

and(D(for(forward(reduc9on.((((•  One(thread(processes(one(pair(of(

elements'(exchange.((

Column(j�

Row(i�

(01(000(1011((

(01(000(1011((

Row(i�

Row(k�

Step%3.((kernel(3)(•  extract(the(current(column(and(copy(

them(to(host((•  host((calculates(the(number(of(rows(

that(need(to(do(reduc9on,(and(their(row(indexes(for(forward(reduc9on.�

Matrix(A� Vector(D�

Matrix(A� Vector(D�

Step%4.((kernel(4)((•  Forward(reduc9on(from(row(((((((i(to(all(the(rows(that(need(to(do((((((((reduc9on(in(both(matrix(A(and(D(•  Send(the(row(i(to(shared(memory(((((((of(each(blocks(that(do(forward((((((((reduc9on.(

Column(j�

Row(i�

(01(100(001(1((

(01(000(000(0((

(10(000(000(0((

(01(000101(1((

(00(000(001(0((

(01(010(010(0((

(10(000(100(0((

(00(000000(1((

Row(i�

Column(j�

Step(5:((kernel(5)(•  extract(the(current(column(j)(to(host.((•  The(host(calculate(the(number(of(

rows(that(need(to(be(subs9tuted(and(their(row(indexes.�

Step(6:((kernel(6)(•  backward(subs9tu9on(from(row(i(to(

all(the(rows(above(it(that(need(to(be(subs9tuted(

•  just(update(vector(D,(the(final((updated(vector(D(is(the(solu9on�

Matrix(A� Vector(D�

Matrix(A� Vector(D�

1.53%%2.81%%

5.39%%

10.30%%

20.28%%

36.07%%

0%

5%

10%

15%

20%

25%

30%

35%

40%

256% 512% 1024% 2048% 4096% 8192%

Speedup�

Symbol%%Size%(Byte)�

Speedup%(CPU/GPU)%of%PACKED%WORD%Version%with%Large%Block%Size%(K%=%8192)%%%�

16.25%% 16.93%%18.38%%

22.68%%

32.85%%

46.27%%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

256% 512% 1024% 2048% 4096% 8192%

Speedup%(CPU/GPU)%of%%WORD%Version%%with%Large%Block%Size%(K%=%8192)%%%�

Speedup�

Symbol%%Size%(Byte)�