optimizing fpga-based accelerator design for deep ... 5/1_chen-zhang-fpga20… · optimizing...

1

Optimizing FPGA-based Accelerator Design for

Deep Convolutional Neural Network

Chen Zhang1, Peng Li3, Guangyu Sun1,2, Yijin Guan1, Bingjun Xiao3, Jason Cong1,2,3

1Peking University2PKU/UCLA Joint Research Institution3University of California, Los Angeles

2

Convolutional Neural Network

Automotive SafetyImage Descriptions @ Stanford

u Image Classification & Object Detection

State-of-the-art Algorithm

• Highest Correctness in Large Scale Visual Recognition Challenge2012 & 2013 & 2014 & 2015

Widely used in industry & academy

3


Output feature

map

Input feature map

Convolution

layer

Max-pooling

layer

Convolution

layer

Max-pooling

layerClassifier

Max-pooling is optional

input imagefeature maps feature maps feature maps feature maps

output

category

4


Output feature

map

Input feature map

Convolution

layer

Max-pooling

layer

Convolution

layer

Max-pooling

layerClassifier

Max-pooling is optional

input imagefeature maps feature maps feature maps feature maps

output

category

Convolutional layers account for over 90% computation

[1] A. Krizhevsky, etc. Imagenet classification with deep convolutional neural networks. NIPS 2012.

[2] J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. ICANN 2014

5

Related Work

1. CNP: An FPGA-based processor for convolutional networks. FPL 2009;

2. A massively parallel coprocessor for convolutional neural networks. ASAP 2009;

3. A programmable parallel accelerator for learning and classification. PACT 2010;

4. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. ISCA 2010;

5. Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short

paper) FCCM 2011;

6. A massively parallel digital learning processor. NIPS 2008;

7. NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision. CVPRW 2011;

8. Accelerating deep neural networks on mobile processor with embedded programmable logic.

(Poster) NIPS 2013;

9. Memory-Centric Accelerator Design for Convolutional Neural Networks. ICCD 2013.

6

Hardware Computing on FPGA

...

Computation Resource

PE-1 PE-2 PE-n

Inter-

connect

buffer1

On-chip memory

buffer2

External Memory

Off-chip Bus

Challenge: BRAM Limits

Solution: Loop Tiling

Challenge: Resrc. Util.

Solution: Unroll & Pipeline

Challenge: BW Limits

Solution: Data reuse

Challenge: Co-optimization,-- match ‘computation’ & ‘communication’

Solution: Roofline Model

Main

Contribution

7

FPGA design in Roofline Model

...

Computation Resource

PE-1 PE-2 PE-n

Inter-

connect

buffer1

On-chip memory

buffer2

External Memory

Off-chip Bus

Computational Perf.

Bandwidth

Computation To

Communication Ratio

GFLOP/S

FLOP / (Mem. Byte Access)

Gbyte/S

8

FPGA design in Roofline Model

GFLOP/S

FLOP/(Memory byte access)Com

pu

tati

onal

Per

form

ance

Computation To Communication (CTC) Ratio

Design

Bandwidth

9


Co

mp

uta

tio

nal

Per

form

ance

GFLOP/S

FLOP/(DRAM byte access)

1

2 3

Attainable Performance

Bandwidth Roof

Computational Roof

A

B

Att

ain

able

Per

form

ance

Attain. Perf. (2 & 3 > 1)

Compt. Perf. (2 & 3 < 1)

Computation

Communication

Computation

Communication

10



Att

ain

able

Per

form

ance

GFLOPS


1

2 3

Computational Roof

Bandwidth Roof


Attain. Perf. (2 & 3 < 1)

11



Att

ain

able

Per

form

ance

GFLOPS


1

2 3

Computational Roof

Bandwidth Roof


Attain. Perf. (2 & 3 < 1)

Optimization Target:Find a Design with Maximized Attainable Performance

12

1 for(row=0; row<R; row+=Tr) {

2 for(col=0; col<C; col+=Tc) {

3 for(to=0; to<M; to+=Tm) {

4 for(ti=0; ti<N; ti+=Tn) {

5 for(trr=row; trr<min(row+Tr, R); trr++) {

6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

7 for(too=to; too<min(to+Tn, M); too++) {

8 for(tii=ti; tii<(ti+Tn, N); tii++) {

9 for(i=0; i<K; i++) {

10 for(j=0; j<K; j++) {

output_fm[to][row][col] +=

weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];

}}}}}}

}}}}

Loop tiling

(Tile loop)

(Tile loop)

(Tile loop)

(Tile loop)

(Point loop)

(Point loop)

(Point loop)

(Point loop)

(Point loop)

(Point loop)

Off-chip Data Transfer: Memory Access Optimization

On-chip Data: Computation Optimization

1 for(row=0; row<R; row++) {

2 for(col=0; col<C; col++) {

3 for(to=0; to<M; to++) {

4 for(ti=0; ti<N; ti++) {

5 for(i=0; i<K; i++) {

6 for(j=0; j<K; j++) {



}}}}}}

R, C, M, N, K, S are all configuration

parameters of the convolutional layer

𝑤𝑖𝑗1𝑙

K

Input feature mapOutput feature map

13







7 for(too=to; too<min(to+Tn, M); too++) {


9 for(i=0; i<K; i++) {

10 for(j=0; j<K; j++) {



}}}}}}

}}}}

Loop tiling

(Tile loop)

(Tile loop)

(Tile loop)

(Tile loop)

(Point loop)

(Point loop)

(Point loop)

(Point loop)

(Point loop)

(Point loop)

Off-chip Data Transfer: Memory Access Optimization

On-chip Data: Computation Optimization

14

…5 for(trr=row; trr<min(row+Tr, R); trr++)

6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++)

7 for(too=to; too<min(to+Tn, M); too++)

8 for(tii=ti; tii<(ti+Tn, N); tii++)

9 for(i=0; i<K; i++) {

10 for(j=0; j<K; j++) {



}}}}}}

…

(Tile loops)

(Point loop)(Point loop)(Point loop)(Point loop)

Computation Optimization

5 for(i=0; i<K; i++) {

6 for(j=0; j<K; j++) {



#pragma HLS pipeline

9 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS UNROLL


#pragma HLS UNROLL



}}}}}}

15

…5 for(trr=row; trr<min(row+Tr, R); trr++)




9 for(i=0; i<K; i++) {

10 for(j=0; j<K; j++) {



}}}}}}

…

(Tile loops)

(Point loop)(Point loop)(Point loop)(Point loop)

Computation Optimization

5 for(i=0; i<K; i++) {

6 for(j=0; j<K; j++) {



#pragma HLS pipeline

9 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS UNROLL


#pragma HLS UNROLL



}}}}}}

Tm

Tn

16

Computational Performance

Output (Tm) 5 10 20 30

Input (Tn) 5 10 20 15

Design 1 Design 2 Design 3 Design 4

DSPs 125 500 2000 2250

Tm

Tn

17

0

10

20

30

40

50

60

70

Computational Performance

Output (Tm) 5 10 20 30

Input (Tn) 5 10 20 15


DSPs 125 500 2000 2250

u Consider Memory Access ! How?

18

Memory Access Optimization





}

}}}

S: foo(output_fm(to, row, col));

load output feature map

store output feature map

5 for(trr=row; trr<min(row+Tr, R); trr++)




9 for(i=0; i<K; i++) {

10 for(j=0; j<K; j++) {


weights[to][ti][i][j]*

input_fm[ti][S*row+i][S*col+j];

}}}}}}

19

Memory Access Optimization





}

}}}








}

}}}




20

Computation To Communication Ratio

0

10

20

30

40

50

60

70

Output (Tm) 5 10 20 30

Input (Tn) 5 10 20 15


Com

puta

tional

Per

form

ance

GFLOP/S

xx

xx



xxxx

1

2

34

21

Design Space

Communication:

Legal Solutions

of Tm & Tn:

Legal Solutions:

2097

3# of memory access methods

Total Legal Solutions:

6291

N=192

N=128

# of PE = 450

22

0

20

40

60

80

100

120

0 10 20 30 40 50 60

Design Space Exploration

Computational roof

CTC Ratio

Att

ainab

le p

erfo

rman

ce (

GF

LO

PS

)

Ratio of

(FLOP)/(DRAM byte access)

C

Bandwidth roof

DA

A’

23

System Overview

Accelerator

Data Transfer Engine0


FIFO

FIFO

MicroBlaze

AXI4Lite

AXI4

Interrupt controller

Timer

Memory Interface Controller

UART

Main Memory

DDR3 interface

On-chip

Off-chip

Vivado 2013.4

Vivado HLS 2013.4

VC707

uEstimated On-board run Difference

Layer 1 7.32 ms 7.67 ms ~5%

Layer 2 5.11 ms 5.35 ms ~5%

Layer 3 3.42 ms 3.79 ms ~9%

Layer 4 2.59 ms 2.88 ms ~10%

Layer 5 1.73 ms 1.93 ms ~10%

Overall 20.17 ms 21.61 ms ~7%

24

Accelerator

Accelerator



FIFO

FIFO

MicroBlaze

AXI4Lite

AXI4

Interrupt controller

Timer

Memory Interface Controller

UART

Main Memory

DDR3 interface

On-chip

Off-chip

25

Experimental Results: vs. CPU

0

10

20

30

40

50

60

70

80

90

100

1 thread -O3 16 threads -O3 FPGA

Energy

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

1 thread -O3 16 threads -O3 FPGA

Speedup

90x

20x

18x

5x

CPU Xeon E5-2430 (32nm) 16 cores 2.2 GHzgcc 4.7.2 –O3

OpenMP 3.0

FPGA Virtex7-485t (28nm) 448 PEs 100MHzVivado 2013.4

Vivado HLS 2013.4

26

Experimental Results: vs. Other FPGAs

FPL2009 ASAP2009 PACT2010 ISCA2010 ICCD2013 Our Impl.

Virtex 4 Virtex 5 Virtex 5 Virtex 5 Virtex 6 Virtex 7

125MHz 115MHz 125MHz 200MHz 150MHz 100MHz

5.25 GOPS 6.74 GOPS 7 GOPS 16 GOPS 17 GOPS 61.6 GOPS

27

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

ASAP2009 FPL2009 PACT2010 ISCA2010 ICCD2013 Our Impl.

Normalized Performance

Experimental Results: vs. Other FPGAs

~80%

GFLOPS/Slice

28

Experimental Results

0

20

40

60

80

100

120

0 10 20 30 40 50 60

CTC Ratio

Att

ainab

le p

erfo

rman

ce (

GF

LO

PS

)

Ratio of

(FLOP)/(DRAM byte

access)

C

Bandwidth roof

A

A’

Comm. Time NOT Covered

Comm. Time Covered

Computation

Communication

Computation

Communication

29

Conclusions

An accelerator for convolutional neural network

Contribution:

Find the best solution with roofline model

Accurate Analytical model for computation & communication

Result:

~3.5x better performance over other FPGA implementations

~80% performance/area improvement

On-board run implementation

30

Thank You

Q & A

optimizing fpga-based accelerator design for deep ... 5/1_chen-zhang-fpga20… · optimizing...

Documents