a near real time decoding for ldpc based distributed video coding using cuda cuda 架構下針對...

49
for LDPC based distributed video coding using CUDA CUDA 架架架架架 架架架架架架架架架架架架架架架架架架 架架架架架架架 CMLab, CSIE, NTU 1 Su, Tse-Chung 架架架 Advisor: Prof. Wu, Ja-Ling 架架架 架架 2011/6/9

Upload: austin-short

Post on 28-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

A near real time decoding for LDPC based distributed video

coding using CUDA

CUDA 架構下針對低密度奇偶校驗碼為基礎之分散式編碼的

近即時解碼設計

CMLab, CSIE, NTU1

Su, Tse-Chung 蘇則仲Advisor: Prof. Wu, Ja-Ling 吳家麟 教授

2011/6/9

Page 2: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Outline

Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU2

Page 3: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Conventional Video Codec

MPEG-2, H.264, HEVC(H.265)

CMLab, CSIE, NTU4

ENCODER DECODERLightweightHeavyweight

Page 4: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Distributed Video Coding(DVC)

A new paradigm for video compression

CMLab, CSIE, NTU5

ENCODER DECODERLightweight Heavyweight

Page 5: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Application of DVC Video conferencing with mobile

devices

CMLab, CSIE, NTU7

DVC to H.264 Transcoder

CloudComputational Resource

DVC encoder(Low Complexity)

H.264 decoder(Low Complexity)

DVC encoded bitstream

H.264 encoded bitstream

Realtime system

Page 6: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Distributed Video Coding

D. Varodayan, A. Aaron, and B. Girod, “Rate-Adaptive Codes for Distributed Source Coding,”EURASIP Signal Processing Journal, Special Issue on Distributed Source Coding,,November 2006.

Channel Encoder

Channel Decoder

LDPCEncoder

LDPCDecoder

Page 7: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Decoding Complexity of DVC

Our DVC codec (state-of-the-art) Parallelized with OpenMP

and CUDA 12 core + GPGPU(Fermi)

~1FPS

CMLab, CSIE, NTU12

DECODERHeavyweight

Page 8: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Amdahl's law

Maximum speedup can be reached by improving the most critical part of the system LDPC decoding in the DVC decoder.

CMLab, CSIE, NTU13LDPCA Others

86%~94%

LDPCA Others

29%~36%15.39 FPS

QCIF

Page 9: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Outline

Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU14

Page 10: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

LDPC decodingSum-Product Algorithm

(Message Passing)

Side Information(real number)+ 0 - 1

4 6 7

甲 乙 丙

3 521

decode outputhard decision

a b c d e f g

a25 b25 c25 d25 e25 f25 g25

Vertical processing

Horizontalprocessing

a1 b1 c1 d1 e1 f1 g1

1 2 3 4 5 6 7

0

1

1

0 1 1

From DVC encoder(syndrome bits)

a b c d e f g

Kschischang, F.R., Frey, B.J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory

Page 11: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Sum-Product AlgorithmVertical Processing

CMLab, CSIE, NTU16

A B C D E

F G IH J

K L OM N

0

1

1 Z

P

a b c d e f g

P = K + F + a

Z = F + P + a

Page 12: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

CMLab, CSIE, NTU17

Sum-Product AlgorithmHorizontal Processing

0

1

1

P Q R S T

U V XW Y

Z A DB C

H

a b c d e f g

K

Hmag=φ (𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑆|)+𝜑 (|𝑇|) )

K mag=φ (𝜑 (|𝑃|)+𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑇|) )

Page 13: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

LDPC Accumulate (LDPCA) codes

22

Rate adaptivity

D. Varodayan et al., "Rate-adaptive codes for distributed source coding," EURASIP Signal Processing Journal, Special Section on Distributed Source Coding, 2006

Page 14: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

65 L

DP

C c

odes

3

Page 15: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Outline

Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA

(Kernel Design) Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU25

Page 16: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Vertical Processing Kernel (VPK) Column degree is constant 3

regular LDPC Shared memory

Horizontal Processing Kernel (HPK) Each message can be update by one thread (SIMD) Variable row degree in each LDPC code Data structure: Circular link list

CUDA thread Block(0)Shared Memory

CUDA thread Block(1)Shared Memory

Previous CUDA implementation

A B C D E

F G H

I J LK

1

Message data

Index data

2 3 4 0 6 7 5 9 10 11 8

CUDA thread 3

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

58

Pai, Y.-S., Cheng, H.-P., Shen, Y.-C. and Wu, J.-L. 2010. Fast decoding for ldpc based distributed video coding. In Proc. of ACM international conference on Multimedia

Page 17: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

CUDA ImplementationStrategy 1

28

Reduction of Φ Function in HPKTexture memory in VPK

Page 18: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Global Memory Texture Binding

Global Memory

Texture Binding in VPK

CMLab, CSIE, NTU29

29

58

A B C D E F G H I J K L

t0

Speedup on both 1.x and 2.x compute capability

Non-coalescing readt0

Page 19: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

LDPCA decoding time in previous CUDA implementation

LDPC(n,m) 100 iterations Decoding time

HPK+VPK(1584, 48) 8.29 ms+1.49 ms

(1584, 192) 3.40 ms +1.52 ms

(1584, 336) 3.04 ms +1.53 ms

(1584, 480) 2.31 ms +1.55 ms

(1584, 624) 2.29 ms +1.54 ms

(1584, 768) 2.00 ms +1.52 ms

(1584, 912) 1.82 ms +1.52 ms

(1584, 1056) 1.81 ms +1.52 ms

(1584, 1200) 1.79 ms +1.50 ms

(1584, 1344) 1.79 ms +1.51 ms

(1584, 1488) 1.78 ms +1.56 ms

CMLab, CSIE, NTU30

……

Page 20: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Global Memory

CUDA thread Block(0)Shared Memory

Reduction of Φ Function in HPK

CMLab, CSIE, NTU31

1 2 3 4 0 6 7 5 9 10 11 8

A B C D E F G H I J K L

1 2 3 4 0 6 7 5

A B C D E F G H

t1 t2 t3 t4 t5 t6 t7t0Copy to shared memory

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Page 21: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Global Memory

CUDA thread Block(0)Shared Memory

Reduction of Φ Function in HPK

CMLab, CSIE, NTU32

1 2 3 4 0 6 7 5 9 10 11 8

A B C D E F G H I J K L

1 2 3 4 0 6 7 5

𝝋 (|𝑨|) 𝝋 (|𝑩|) 𝝋 (|𝑪|) 𝝋 (|𝑫|) 𝝋 (|𝑬|) 𝝋 (|𝑭|) 𝝋 (|𝑮|) 𝝋 (|𝑯|)

t0 t1 t2 t3 t4 t5 t6 t7

Calculate functions before copying to shared memory

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Number of φ(x): row degree

2

Page 22: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

33

Previous Implementation 124.47 sec

Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x

StepSpeedupLDPCA Time

CumulativeSpeedup

LDPCA Performance -- foreman sequence (QCIF)

Page 23: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

34

Parallel Partial Reductionin HPK

CUDA ImplementationStrategy 2

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Page 24: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2Values (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict free

Parallel Reduction

CMLab, CSIE, NTU

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Page 25: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Computation Overlapping in HPK

CMLab, CSIE, NTU37

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 3:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐸|) )t 4 :φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|) )

t 0 :φ (Mag−𝜑 (|𝐴|) )t 1 :φ (Mag−𝜑 (|𝐵|) )t 2:φ (Mag−𝜑 (|𝐶|) )t 3:φ (Mag−𝜑 (|𝐷|) )t 4 :φ (Mag−𝜑 (|𝐸|) )

=

Magnitude

t1 t2 t3 t4t0

Parallel partial reduction

Page 26: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

CUDA thread Block(0)Shared Memory

φ(|A|)= 0.1

φ(|B|) = 0.2

φ(|C|)

= 0.4

φ(|D|)= 0.7

φ(|E|) = 0.3

0 0 0 φ(|F|)= 0.1

φ(|G|)=0.7

φ(|H|)=0.4

0

t1 t2 t3 t0 t1 t2 t3 t8

0.4 0.2 0.4 0.7 0.3 0 0 0 0.5 0.7 0.4 0

t9 t8 t9

t8 t8

0.8 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0

1.7 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0

t0 t0

t0

t1 t0 t1t0

Global Memory

A B C D E 0 0 0 F G H 0 I J K L

Log(

row

Deg

) =

3

Mag0 Mag1

(8,0)

(8,1)

(8,2)

(8,3)

(8,4)

(0,0)

(0,0)

(0,0)

(4,0)

(4,1)

(4,2)

(0,0)

(4,0)

(4,1)

(4,2)

(4,3)

rowDeg = 8 rowDeg = 4

index

message

idle threads

Parallel Partial Reduction

CMLab, CSIE, NTU

t0 t1 t2 t3 t4 t8 t9 t10

Page 27: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

39

Check Node Re-orderingCompletely Unrolling

CUDA ImplementationStrategy 3

Page 28: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

I J K L M

CMLab, CSIE, NTU40

A B C D E F G HShared MemoryCUDA thread Block(0) CUDA thread Block(1)

CUDA thread Block(0) CUDA thread Block(1)

rowDeg = 4rowDeg = 8 rowDeg =8

1 2 3 4 5 60

012

Variable node

Check nodeCheck node

Variable node

1 2 3 4 5 60

0 1 2

Check Node Re-ordering

A B C D E I J K L M F G H

3 3

23

Page 29: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Redundant if else & __syncthreads()

int i = threadIdx.x;Int half = rowDeg >> 1;float myMag = s_mag[i] ;char mySign = s_sign[i] ;do{ if(rowPos < half){ s_mag[i] += s_mag[i+half]; s_sign[i] ^= s_sign[i+half]; } half >>= 1; __syncthreads();}while(half);Int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;

int i = threadIdx.x;float myMag = s_mag[i] ;char mySign = s_sign[i] ;If(rowDeg==16){ s_mag[i] += s_mag[i+8]; s_sign[i] ^= s_sign[i+8]; s_mag[i] += s_mag[i+4]; s_sign[i] ^= s_sign[i+4]; s_mag[i] += s_mag[i+2]; s_sign[i] ^= s_sign[i+2]; s_mag[i] += s_mag[i+1]; s_sign[i] ^= s_sign[i+1]; }else if ( rowDeg == 8 ){ ….. }int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;

Branch divergence

harmperformance

No branch divergence

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Completely unrolling

Page 30: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

42

Combination of VPK and HPK

CUDA ImplementationStrategy 4

Page 31: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Kernel Launch Overhead

CMLab, CSIE, NTU43

NVIDIA CUDA PROGRAMMING GUIDE(3.2) 5.2.1

1. Parallelism is broken (Implicit Inter-Block Synchronization)2. Extra global memory traffic

HPKVPK HPKVPK HPKVPK

VPK+HPK=UMK

VPK+HPK=UMK

VPK+HPK=UMK

Page 32: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

45

LDPCA Performance -- foreman sequence (QCIF)

Previous Implementation124.47 sec

Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x

Strategy 2:PPR in HPK 40.66 sec 1.30x 3.06x

Strategy 3:Merge HPK & VPK 28.80 sec 1.41x 4.32x

Strategy 4:Check Node Re-ordering & Completely Unrolling

22.29 sec 1.29x 5.58x

StepSpeedup

Time CumulativeSpeedup

Page 33: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Outline

Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA (CUDA API) Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU48

Page 34: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

49

Early Stop Detection

CUDA ImplementationStrategy 5

Page 35: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

UMK UMKUMK UMKUMK UMKUMK UMK

Early Stop Detection in Sum-Product Algorithm

CMLab, CSIE, NTU50

SPA iteration 1 SPA iteration 2

timeSPA iteration

100. . .

GPUCPU

time

UMKEDK

PCI-E transfer

. . .

CPUHorizontal Processing + Vertical Processing

E

Early stop detection

Early stop Detection

Kernel

Transmit codeword&decoded info

Check decode info

iter.1 Check iter. 1 iter.2 Check iter. 2

Terminated at iteration 301. Successfully decoded2. Converge to wrong codeword

UMKEDK

PCI-E transfer

Page 36: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Combination of EDK and UMK

The SPA algorithm is memory intensive in CUDA The index data of UMK is also used by early stop

detection (EDK) EDK+UMK = EDUMK

14% additional complexity in terms of execution time

CMLab, CSIE, NTU51

0

1

1

a b c d e f g

UMK UMKUMK UMKUMK UMKGPUCPU

time

UMKEDK

PCI-E transfer

. . .UMKEDK

PCI-E transfer

Page 37: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Concurrent Kernel Execution and Data Transfer

CMLab, CSIE, NTU52

UMKGPU UMK UMK

PCI-E transfer

time

EDUMK UMK EDUMK

PCI-E transfer

UMK UMK UMK

Early Stop Detection for iter.1Run UMK for iter.2 iter.3

iter.1

Early Stop Detection for iter.5Run UMK for iter.6

Receive decode info & codewordfor iter.1

Receive decode info &codeword

for iter.5

iter.9

CPU

Ideal Timeline

Page 38: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Practical CUDA Implementation for Early Stopping Detection

Use 1 CPU thread, 1 GPU Use CUDA Driver API instead of Runtime API Nearly no Stream Management instructions

cudaStreamSynchronize(), cudaStreamQuery(), or cudaStreamWaitEvent()

CMLab, CSIE, NTU53

. . .

Stream 2

UMK UMK

Stream 0

UMKUMK

PCI-E transfer

time

EDUMKUMK UMK UMK

PCI-E transferEDUMK

#overlap = 3

EDUMKStream 1

host

~~~~~~~~~

~~~~~~~~~Explicit synchronization

#overlap = 3

Page 39: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Speed-up ratio of early stop detection

CMLab, CSIE, NTU55

Total number of LDPCA iterations

20000

Fix iteration

10000

Early stop detection

10%

overhead

1.8x

Actual speedup

2.0x

Theoretical speedup

Overhead on CPU

5%

Overhead on GPU Using Runtime API

20% 7%

Overhead on GPU Using driver API

Page 40: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

LDPCA Performance -- foreman sequence (QCIF)

Previous Implementation 124.47 sec

Strategy 1:(fix 100 Iter)reduce φ, texture memory 52.94 sec 2.35x 2.35x

Strategy 2:(fix 100 Iter)PPR in HPK 40.66 sec 1.30x 3.06x

Strategy 3:(fix 100 Iter)Merge HPK & VPK 28.80 sec 1.41x 4.32x

Strategy 4:(fix 100 Iter)Check Node Re-ordering & Completely Unrolling

22.29 sec 1.29x 5.58x

Strategy 5:(max 100 Iter)Early Stop Detection (Driver API)

10.86 sec 2.02x 11.27x

StepSpeedup

Time CumulativeSpeedup

449.63x faster than sequential program!

Page 41: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Outline

Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU59

Page 42: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Test condition

12 CPU, 24 processor Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

GPU: Tesla M2050 14 (MP) x 32 (Cores/MP) = 448 (Cores) CUDA capability 2.0 Shared memory: 48K Maximum threads in block: 1024 Concurrent copy and execution Concurrent kernel execution

Page 43: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Test condition Test sequences:

QCIF, 15Hz, all frames GOP size: 8 Qindex: 8 Bitrate and PSNR: only luminance

componentCMLab, CSIE, NTU61

Soccer Foreman Coastguard Hall MonitorHigh LowMotion

Page 44: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Speedup Ratio of LDPCA decoder Using CUDA

15.39 FPS

1.14 fps

7.14 FPS

0.96 fps

4.99 FPS

0.79 fps

10.29 FPS

1.05 fps7.43 ↑

6.32 ↑13.5 ↑

9.8 ↑15.35 ↑LDPCA

22.51 ↑LDPCA

12.88 ↑LDPCA

36.91 ↑LDPCA

0.2% bit rate↑

Page 45: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

LDPCA decoding time comparison

100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)

9800GTX 1.93~1.83ms 1.09~1.27ms 3.26~3.34ms 1.87~2.12ms

Tesla T10 1.23~1.26ms 0.67~0.70ms 2.39~2.52ms 1.27~1.34ms

Tesla C2050 0.55~0.60ms 0.29~0.31ms 1.25~1.34ms 0.65~0.69ms

100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)

GTX260 35ms 18ms 46ms 24ms

GeForce 9800

GTX+

Tesla

C1060

GeForce

GTX260

Tesla

C2050

Compute Capability 1.1 1.3 1.3 2.0

MP x Cores/MP 16x8 30x8 27x8 14x32

Ryanggeun, O., Jongbin, P. and Byeungwoo, J. 2010. Fast implementation of wyner-ziv video codec using gpgpu. In Proc. of IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, 1-5.

Page 46: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Realtime Decoding Quality

27.44db, 76kbps

39.46db, 147.64kbps 35.34db, 263.52 kbps

29.21db, 93.17 kbpsOriginal Sequence

Original Sequence

Original Sequence

Original Sequence

Page 47: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Conclusion

Fully parallelized LDPCA decoder using CUDA with various features

The proposed early stop detection mechanism reduces the latency between the CPU and the GPU

Videos in surveillance sequence (e.g. hall monitor) can be decoded in real-time with negligible RD performance loss

CMLab, CSIE, NTU72

Page 48: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Future Work

Bitplane level parallelization for LDPCA UV component Frame level parallelization

Vitor Silva

a2 b2 c2 d2 e2 f2 g2 a3 b3 c3 d3 e3 f3 g3

4 6 7

1 2 303 13 03

Soft input

3 521

Vertical processing

Horizontalprocessing

a1 b1 c1 d1 e1 f1 g1

syndrome

02 12 02

01 11 01

Page 49: A near real time decoding for LDPC based distributed video coding using CUDA CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計 CMLab,

Thank You

CMLab, CSIE, NTU74