architecting an lte base station with graphics processing...

Post on 23-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 1

1

1 University of Michigan

Architecting an LTE Base Station with Graphics Processing Units

Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti+, Achilleas Anastasopoulos*,

Scott Mahlke*, Trevor Mudge* *University of Michigan, Ann Arbor +Arizona State University, Tempe

SiPS ’13

Oct 17, 2013

2 2

2

2 University of Michigan

Wireless Base Station

3 3

3

3 University of Michigan

Baseband SDR in a Base Station

GOPS Throughput

4 4

4

4 University of Michigan

1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+03

Peak

Dat

a R

ate

(Mbp

s)

Technology Evolution

5 5

5

5 University of Michigan

Baseband processor

Processor for Wireless Base Station

!  Flexibility -- Good programmability

!  Performance -- High computing throughput

6 6

6

6 University of Michigan

GPU

Baseband processor

Processor for Wireless Base Station

!  Flexibility -- Good programmability

!  Performance -- High computing throughput

7 7

7

7 University of Michigan

Graphics Processing Unit "  High Throughput:

"  GOPS/TOPS-level peak throughput

"  Good Programming Support "  CUDA "  OpenCL

"  High Efficiency

Processor GFLOP/dollar GFLOP/watt Nvidia GeForce GTX680 6.192 15.848

Intel Xeon E7-8837 0.037 0.656

Intel Itanium 8350 0.007 0.150

8 8

8

8 University of Michigan

GPU Architecture "  SIMT – Single Instruction Multiple Threads

"  Thousands of cores on a GPU "  Explore Data-Level Parallelism

t0 t1 t2 tn-2 tn-1 tn ..…

d0 d1 d2 dn-2 dn-1 dn ..…

Instruction Instruction Fetch Unit

ALU

Memory

9 9

9

9 University of Michigan

GPU Architecture "  SIMT – Single Instruction Multiple Threads

"  Thousands of cores on a GPU "  Explore Data-Level Parallelism

" Multithreading "  Hide long memory latency

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

Thread Group 0: ld R0 [R1+Offset] sub R2, R0, #2 add R0, R2, R3

Dcache miss

10 10

10

10 University of Michigan

GPU Architecture "  SIMT – Single Instruction Multiple Threads

"  Thousands of cores on a GPU "  Explore Data-Level Parallelism

" Multithreading "  Hide long memory latency

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

Thread Group 1: mul R3,R4,R5,R6 ld R0 [R1+Offset] sub R2, R0, #2 add R0, R2, R3

11 11

11

11 University of Michigan

GPU Architecture "  SIMT – Single Instruction Multiple Threads

"  Thousands of cores on a GPU "  Explore Data-Level Parallelism

" Multithreading "  Hide long memory latency

GOPS/TOPS-level peak throughput

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

SIMT

Multithreading

12 12

12

12 University of Michigan

GPU Mapping Challenge "  Core underutilization

"  Over 1000 cores on a commercial GPU

"  Pipeline stall "  Long memory access latency

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

t0 t1 t2 tn-2 tn-1 tn ..…

SIMT

Multithreading

13 13

13

13 University of Michigan

Our Contribution "  Previous works

" Mapped only a single kernel onto a GPU

"  Base station study on traditional platforms, such as DSP, FPGA, etc

"  In this work "  Key Kernel Parallelization

"  Kernel runtime performance

" Minimum number of GPUs needed

"  System Power consumption

14 14

14

14 University of Michigan

Outline " Motivation

"  GPU Architecture "  Key Kernels Parallelization

"  Experimental Results

"  Conclusion

15 15

15

15 University of Michigan

List of Key Kernels "  PHY Layer

"  SC-FDMA demodulation (FFT) "  Transform decoder (IDFT) "  Channel estimation " MIMO detection " Modulation demapper

"  Turbo decoder

16 16

16

16 University of Michigan

Parallelism in PHY Layer Kernels

Parallelism Description Num of threads

User-level Process data from different users in parallel #thread = Nusr

Antenna-level Process data from different receiver antennae in parallel #thread = Nant

Symbol-level Processing SC-FDMA symbols in a subframe in parallel #thread = Nsym

Subcarrier-level Each subcarrier in a symbol of is processed in parallel #thread = Nsub

Algorithm-level parallelism inherent in each algorithm

Varies based on kernels

17 17

17

17 University of Michigan

Parallelism in PHY Layer Kernels Kernel Parallelism Number of threads

FFT/IFFT User-level

Antenna-level Symbol-level

Algorithm-level

Channel Estimation

User-level Antenna-level

Subcarrier-level

MIMO detector User-level

Symbol-level Subcarrier-level

Modulation demapper

User-level Antenna-level Symbol-level

Subcarrier-level Algorithm-level

Nusr × Nant × Nsym × NFFT

Nusr × Nant × Nsub

Nusr × Nsym × Nsub

Nusr × Nant × Nsym × Nsub × NMod

Nusr × Nant × Nsym × Nsub × log2 (NMod )

18 18

18

18 University of Michigan

Parallelism in Turbo Decoder* "  Total number of threads

"  Implementation performance tradeoff

Nthread = N packet ⋅Nsubblock ⋅Threadtrellis

Parallelism Scheme Throughput Latency Bit Error Rate

Packet-level Better Worse No Change

Subblock-level Better No Change Worse

Trellis-level Better No Change No Change

Subblock+NII Worse No Change Better

Subblock+TS Worse No Change Better

*Qi Zheng, et al, “Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors”, ISCAS 2013, Beijing

19 19

19

19 University of Michigan

Experimental Setup "  Nvidia GeForce GTX680

"  8 Streaming Multiprocessors "  1536 Streaming Processors "  64KB L1 cache + shared memory "  512KB L2 cache "  2GB DRAM

"  GPU Runtime measure "  CUDA event record

"  GPU Power measure "  GPU-Z

20 20

20

20 University of Michigan

Experimental Setup Kernel Configuration

Turbo decoder

Code Rate = 1/3, Codeword Length = 6144, Iteration Number = 5

Modulation demapper 16QAM/64QAM

SC-FDMA FFT 2048

Decoding IFFT 1200

MIMO 1x1, 2x2, 4x4

21 21

21

University of Michigan

"  One LTE subframe on a GPU "  Different antenna configurations

PHY Layer Kernel Runtime

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

(#'"

$"

))*" +))*" ,+,-"./0/1023" 45677/8"/9:;6:27"

,2.<86:27"./;6==/3>(&?@,"

,2.<86:27"./;6==/3>&%?@,"

ABC"D6E/3"F/

37/8"G<7

:;/9"H;

9I"2J"6

7"D*K"9<LJ36;/"

(M(" $M$" %M%"

22 22

22

22 University of Michigan

Turbo Decoder Performance

Schemes Throughput

(Mbps)

Worst-case Codeword Latency

(ms)

BER

Tellis-level Subblock Num

Codeword Num

State-level 512 2 77.6 0.7 1.6×10-3

State-level 256 4 78.2 1.7 4.1×10-3

State-level For-Back 256 2 78.3 0.7 4.1×10-3

State-level For-Back 128 7 80.6 3.1 2.0×10-3

23 23

23

23 University of Michigan

Turbo Decoder Performance

Schemes Throughput

(Mbps)

Worst-case Codeword Latency

(ms)

BER

Tellis-level Subblock Num

Codeword Num

State-level 512 2 77.6 0.7 1.6×10-3

State-level 256 4 78.2 1.7 4.1×10-3

State-level For-Back 256 2 78.3 0.7 4.1×10-3

State-level For-Back 128 7 80.6 3.1 2.0×10-3

24 24

24

24 University of Michigan

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

'!" )'" #!!" #'!" $!!" %!!"

,-.

/01"2

3"456

(*!"47

89"

70:;"<:=:">:=0"?@/A9B"

7CD" 5-1/2" 52=:E"

Number of needed GPUs "  Meet latency and throughput requirements

• (tk )∑ ≤ 1ms for a subframe• Thturbo ≥Thsys

25 25

25

25 University of Michigan

Number of needed GPUs "  Meet latency and throughput requirements

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

'!" )'" #!!" #'!" $!!" %!!"

,-.

/01"2

3"456

(*!"47

89"

70:;"<:=:">:=0"?@/A9B"

7CD" 5-1/2" 52=:E"

• (tk )∑ ≤ 1ms for a subframe• Thturbo ≥Thsys

26 26

26

26 University of Michigan

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

'!" )'" #!!" #'!" $!!" %!!"

,-.

/01"2

3"456

(*!"47

89"

70:;"<:=:">:=0"?@/A9B"

7CD" 5-1/2" 52=:E"

Number of needed GPUs "  Meet latency and throughput requirements

• (tk )∑ ≤ 1ms for a subframe• Thturbo ≥Thsys

27 27

27

University of Michigan

Power and Energy "  Support 75Mbps

"  Two GTX680 GPUs + One Intel Core 2 CPU "  Total power is 188W

!""#

$%"#

&%'#

()%&#

!%(# !%$#

*+,-./#0123456-78,9#

:4-5;#<,=;<,-## >?@ABCD#AA:##B,=;<E+.#FAA:## C;<4G7H;+#<,87II,-##?J7++,G#,3H87H;+## CFCK#<,L,=L;-##

!"#

!$#

$"#

$$#

%"#

%$#

&'()*#+,-#

!""#

$%"#

&%'#

()%&#

!%(# !%$#

*+,-./#0123456-78,9#

:4-5;#<,=;<,-## >?@ABCD#AA:## B,=;<E+.#FAA:## C;<4G7H;+#<,87II,-## ?J7++,G#,3H87H;+## CFCK#<,L,=L;-##

28 28

28

28 University of Michigan

Conclusion "  Highly parallel GPU implementations of all key kernels

"  Kernel runtimes under different configurations

"  Up to four GTX680 GPUs needed for ≤150Mbps "  Can fit into a motherboard with low latency

"  Dual-GPU solution consumes 188W for 75Mbps "  Competitive with a commercial solution

29 29

29

29 University of Michigan

Thanks!

Any questions?

30 30

30

30 University of Michigan

Backup

31 31

31

31 University of Michigan

Our Contribution "  Previous works

" Mapped only a single LTE kernel onto a GPU

"  Base station study on traditional platforms, such as DSP, FPGA, etc

"  In this work "  Parallel implementations of LTE key signal processing kernels on GPU

"  Kernel runtime performance under different system configurations

"  The number of GPUs needed for the baseband subsystem in an LTE base station

"  Power consumption of the GPU-based solution

32 32

32

32 University of Michigan

Parallelism in PHY Layer Kernels "  User-level Parallelism

"  #thread = Nusr

"  Antenna-level Parallelism "  #thread = Nant

"  Symbol-level Parallelism "  #thread = Nsym

"  Subcarrier-level Parallelism "  #thread = Nsub

"  Algorithm-level Parallelism

33 33

33

33 University of Michigan

Parallelism in PHY Layer Kernels "  User-level Parallelism

"  #thread = Nusr

"  Antenna-level Parallelism "  #thread = Nant

"  Symbol-level Parallelism "  #thread = Nsym

"  Subcarrier-level Parallelism "  #thread = Nsub

"  Algorithm-level Parallelism

34 34

34

34 University of Michigan

Parallelism in PHY Layer Kernels "  User-level Parallelism

"  #thread = Nusr

"  Antenna-level Parallelism "  #thread = Nant

"  Symbol-level Parallelism "  #thread = Nsym

"  Subcarrier-level Parallelism "  #thread = Nsub

"  Algorithm-level Parallelism

35 35

35

35 University of Michigan

Parallelism in PHY Layer Kernels "  User-level Parallelism

"  #thread = Nusr

"  Antenna-level Parallelism "  #thread = Nant

"  Symbol-level Parallelism "  #thread = Nsym

"  Subcarrier-level Parallelism "  #thread = Nsub

"  Algorithm-level Parallelism

36 36

36

36 University of Michigan

Parallelism in PHY Layer Kernels "  User-level Parallelism

"  #thread = Nusr

"  Antenna-level Parallelism "  #thread = Nant

"  Symbol-level Parallelism "  #thread = Nsym

"  Subcarrier-level Parallelism "  #thread = Nsub

"  Algorithm-level Parallelism

37 37

37

37 University of Michigan

Parallelism in PHY Layer Kernels "  User-level Parallelism

"  #thread = Nusr

"  Antenna-level Parallelism "  #thread = Nant

"  Symbol-level Parallelism "  #thread = Nsym

"  Subcarrier-level Parallelism "  #thread = Nsub

"  Algorithm-level Parallelism

38 38

38

38 University of Michigan

Number of needed GPUs "  The minimum number of GTX680 GPUs needed for the

baseband system of an LTE base station

Peak Data Rate (Mbps)

Number of GPUs

PHY Turbo Total

50 1 1 2

75 1 1 2

100 1 2 3

150 2 2 4

200 2 3 5

300 5 4 9

39 39

39

39 University of Michigan

Processor for Wireless Base Station

Good programmability

High computing throughput

40 40

40

40 University of Michigan

Processor for Wireless Base Station

High computing throughput

Good programmability GPU

41 41

41

University of Michigan

Power and Energy "  Support 75Mbps

"  two GTX680 GPUs + one Intel Core 2 CPU "  Total power is 188W

Kernel Power (W) Energy (J/subframe)

Turbo decoder 63.3 144.0

SC-FDMA FFT 56.7 3.4

Decoding IFFT 56.9 5.7

Modulation demapper 56.3 26.5

Channel estimation 61.8 1.2

MIMO detector 57.7 1.3

top related