optimizing frequency domain implementation of …ceng.usc.edu/techreports/2017/prasanna...

12
Optimizing Frequency Domain Implementation of CNNs on FPGAs Hanqing Zeng, Ren Chen, Viktor K. Prasanna Computer Engineering Technical Report Number CENG-2017-03 Ming Hsieh Department of Electrical Engineering Systems University of Southern California Los Angeles, California 90089-2562 July 2017

Upload: dinhtram

Post on 23-May-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Optimizing Frequency Domain Implementation of CNNs on

FPGAs

Hanqing Zeng Ren Chen Viktor K Prasanna

Computer Engineering Technical Report Number CENG-2017-03

Ming Hsieh Department of Electrical Engineering ndash Systems

University of Southern California

Los Angeles California 90089-2562

July 2017

Optimizing Frequency Domain Implementation of CNNs onFPGAs

Hanqing Zeng Ren Chen Viktor K PrasannaMing Hsieh Department of Electrical Engineering

University of Southern CaliforniaLos Angeles California 90089

zenghrenchenprasannauscedu

ABSTRACTConvolutional neural networks (CNNs) have become popular formany machine learning applications Recently accelerators eitherin ASIC or using FPGA have been proposed to achieve low latencyfor classication as well as reduce the energy consumption How-ever with increasing variety of deep learning models and prolifer-ation of FPGA devices and families realizing a high-performancedesign becomes challenging In this paper we propose an algorithm-architecture co-design methodology based on the computationalcharacteristics of CNN models and the features of underlying hard-ware to realize high performance designs To speed up various CNNmodels our methodology consists of two levels of optimization forconvolution in the frequency domain (1) At the algorithm levelwe reduce the total number of operations We propose a new tech-nique called Concatenate and Pad (CaP) By applying it with its dualoperation Overlap and Add (OaA) we can match various imagesizes to any given FFT size Our hybrid algorithm based on nativeFFT OaA-FFT and CaP-OaA-FFT achieves signicant computationreduction for a wide range of CNNs (2) At the architecture levelour goal is to maximize the utilization of limited on-chip resourcesWe develop highly ecient hardware building blocks to support ouralgorithmic optimizations We develop a simplied performancemodel which captures the constraints of a given hardware devicesuch as external bandwidth logic resources and on-chip memoryThe performance model leads to a reduced design space which isexplored using bounded parameter scan to realize a high perfor-mance FPGA design for a CNN We illustrate our methodology byproposing two FPGA designs for AlexNet using throughput andlatency as performance metrics Experimental results show that thetwo designs achieve throughput of 1634 GOPS and latency of 124ms respectively on the Intel HARP heterogeneous platform Ourdesigns signicantly improve the state-of-the-art implementationsof AlexNet on FPGAs

KEYWORDSConvolutional Neural Networks Fast Fourier Transform FPGAOverlap and Add Optimization

1 INTRODUCTIONConvolutional Neural Networks (CNNs) are gaining increasingpopularity in a variety of applications including computer visionsignal processing and natural language processing [14 24 26] Toachieve better accuracy deeper and larger CNN models are beingexplored This leads to high throughput requirement [19] To embedCNNs into real time applications single image classifcation time is

critical This results in low latency requirement [4] Furthermorewith the progress in deep learning there is increasing diversityin terms of CNN models and accelerators Therefore a exibleaccelerator design for large scale CNNs to realize high throughputand low latency becomes a crucial problem to address

In recent years many CNN accelerators have been implementedon GPU [3 13 22] ASIC [10] and FPGA [20 25 27] Energy ef-ciency recongurability and unprecedented logic density makeFPGAs attractive for high performance CNN processing Yet di-culties exist to fully make use of such platforms in deep learningapplications The rst challenge is to deal with the large problemsize Motivated by the huge amount of operations required by spa-tial convolution (the convolution that performs sliding windowoperation of the kernel over the image) alternatives such as Wino-grad transform [15] and frequency domain convolution [18] havebeen proposed Recent implementations of these algorithms haveshown promising results [5 28] The other challenge lies in largevariance of the parameters among dierent convolution layerssuch as the image size kernel size and number of feature maps Forresource eciency the same hardware module is reused for dier-ent layers Yet some hardware modules donrsquot have the exibilty toperform well under the computational requirements of dierentCNN layers A design containing such modules is very inecientbecause large amounts of runtime reconguration will be requiredby the many layers in a deep CNN In addition the design space offeasible hardware designs over all the layers can be huge Limiteddesign space exploration in [16] takes as long as 5 hours on twodesktops

In this work we improve the performance of CNNs runningon FPGAs by identifying designs that perform well on all convo-lution layers We propose an optimization ow with algorithm-architecture co-design based on convolution in frequency domainTo deal with the variance of image sizes and kernel sizes we proposea dual operation of Overlap and Add (OaA) [12] called Concatenateand Pad (CaP) We prove that frequency domain convolution usingCaP-OaA performs equally well as the native frequency domainconvolution in terms of number of operations Yet the CaP-OaAapproach needs no hardware reconguration for dierent layerswhile the native frequency domain convolution incurs signicantreconguration overhead for every layer We further propose ahybrid algorithm which combines OaA and CaP-OaA with the na-tive frequency domain convolution The hybrid algorithm is ableto specically optimize either throughput or latency for a givenCNN To address the variance in the number of feature maps wego into the architecture level by proposing an ecient design space

exploration algorithm We rst develop high performance hard-ware modules to support our hybrid convolution algorithm andderive an analytical performance model for our architecture Ourdesign space exploration algorithm is based on the observation thatoptimal conguration of a single layer is a good indication for theoptimal conguration of the entire network Thus the design spacefor a complete CNN can be bounded by the local optimum of eachindividual layer

The main contributions of this work arebull We analyze the complexity of frequency domain convolu-

tion using Overlap and Add (OaA) specically in the CNNcontext We propose a dual operation of OaA called Con-catenate and Add (CaP) and design a hybrid algorithmthat combines OaA CaP and the native frequency domainconvolution

bull We show that the hybrid algorithm signicantly reducesthe computation complexity for a wide range of CNN mod-els Compared with spatial convolution it results in 575reduction in the number of operations for AlexNet withoutneeding for runtime reconguration

bull We derive a comprehensive performance model whichtakes into account details of the CNN and hardware con-straints such as peak bandwidth to external memory andavailable on-chip memory logic and DSP resources Wepropose a methodology to signicantly reduce the designspace of this performance model

bull We develop a highly-optimized hardware design with eachindividual modules in the form of parameterized soft IPcores The building blocks such as run-time congurable2D FFT module can be easily congured for dierent CNNsand FPGAs

bull We implement our design on the Intel HARP heteroge-neous platform where the FPGA is responsible for our hy-brid algorithm and the CPU performs other light-weightoperations such as overlapping and padding

bull We demonstrate our design methodology by developingtwo designs for AlexNet one optimized for high through-put and the other for low latency respectively The designsshow throughput of 1634 GOPS and latency of 124ms onthe Intel HARP platform [1]

2 BACKGROUND AND RELATEDWORKState-of-the-art convolutional neural networks for image classi-cation can achieve very high accuracy over a thousand inputcategories [14 24] A CNN is formed by stacking various layerstogether A typical design includes four types of layers convolutionlayers rectied linear unit (ReLU) layers pooling layers and fullyconnected layers Convolution layers extract features out of the in-put 2D images Pooling layers enable the network to gradually shiftits focus from local features to global features ReLU layers add non-linearity into the network Finally the partially processed featuresare fed into fully connected layers to complete the classication

21 Convolution in Frequency DomainLet the input matrix to a convolution layer be I whose dimensionis fin times limд times limд the kernel of the layer be K whose dimension

is fout times fin times lkern times lkern The convolution layer does the con-volution operation over the last two dimensions of I and K andaccumulates over fin dimension As a result the output to thatlayer is of dimension fout times l

prime

imд times lprime

imд Spatial convolution basically performs sliding window opera-

tion Alternatively a more ecient algorithm is by computing inthe frequency domain [18] Convolution in the spatial domain isequivalent to Hadamard product () in the frequency domain Thealgorithm is summarized by Equation 1 where F is Fourier trans-form and Fminus1 denotes inverse Fourier transform

I lowast K = Fminus1(F (I ) F (K )

)(1)

There are two noteworthy points First of all typical convolu-tion is companied by stride S Striding is applied directly to thesliding window in spatial convolution and can be achieved in thefrequency domain by subsampling the output matrix after the in-verse Fourier transform Secondly the I and K matrix need to bepadded to the same size before the Fourier transform This ensuresthat the Hadamard product is valid Since the input images to therst few convolution layers are quite large computing FFT on thefull I matrix is not always preferable We analyze the frequencydomain convolution with various image sizes in Section 3

22 Overlap and Add (OaA)To avoid the operation on a large matrix the idea is to divide theoriginal image matrix into smaller tiles After performing frequencydomain operation on each tile with the kernel the resulting tilesare combined to form the nal output [2]

The following describes the basic procedures of computing FFTusing Overlap and Add (OaA) technique Given an limд timeslimд inputimage I and an lkern times lkern kernel K we convolve them using Npoint 2D FFT units (subject to N gt lkern ) First we divide I intotilesT ini j of size lt ile times lt ile (where lt ile + lkern minus 1 = N ) Then afterzero padding to size N timesN we can compute the intermediate outputtiles using Equation 2 To get the nal matrix R we move the outputtiles T outi j so that pixel (0 0) of each tile is overlapped with pixel(i middot lt ile j middot lt ile ) of R Each pixel in R is the sum of correspondingpixels in Ri j The operation is summarized by Equation 3

T outi j = Fminus1(F (T ini j ) F (K )

)(2)

R[p][q] =sumi j

(T outi j [p minus i middot lt ile ][q minus j middot lt ile ]

)where

0 6 p minus i middot lt ile lt lt ile

0 6 q minus j middot lt ile lt lt ile

(3)

In the above equations the square brackets [lowast][lowast] indicates thepixel index within the 2D matrix and all indices i jpq starts from0

The Overlap and Add technique together with convolution infrequency domain reduces the complexity of spatial convolutionfrom O(l2imд middot l

2kern ) to O(l2imд middot log lkern ) [12] for a single frame

Further analysis of such complexity in the CNN context will beconducted in Section 3

Table 1 AlexNet Specication

Layer Image Size Kernel Size Input Features

CONV1 224 times 224 11 times 11 3 (RGB)

CONV2 55 times 55 5 times 5 96

CONV3 27 times 27 3 times 3 256

CONV4 13 times 13 3 times 3 384

CONV5 13 times 13 3 times 3 384

23 CNN ModelsImageNet competition in recent years has produced many well-designed CNNs for image classication [14 24] Popular designsuse multiple convolution layers which gradually extract featuresfrom local to global scale when proceeding to deeper layers Thecomputation is conducted on data of high dimensions (imagesBatchtimes fin timeslimд timeslimд kernels fout times fin timeslkern timeslkern ) whereBatch species the size for batch processing From the specicationsshown in Table 1 we observe how the network extracts out moreand more features when images are transformed to lower and lowerresolution

24 CNN AcceleratorsTo speedup the inferencing of large scale CNNs much eort hasbeen spent with the help of dedicated hardware accelerators Projectsin [16 23 25 27] target at spatial convolution They optimize thehardware by applying techiniques such as loop tilingunrolling anddata mapping Little optimization is done from the computationcomplexity point of view and the design space of their designs canbe huge On the other hand works in [5 9 28] attempt to applyWinograd transform or frequency domain convolution to reducethe number of operations The promising results shown by theminspire further study into these algorithms Our work proceedsfrom [28] to further optimize frequency domain convolution onFPGAs with a novel algorithm-architecture co-design

3 ALGORITHM LEVEL OPTIMIZATIONPerforming convolution in frequency domain will reduce the com-putation complexity from O

(l2imд middotl

2kern middot fin middot fout

)to O

(l2imд middot fin middot

fout + l2imд middot log limд middot (fin + fout ))

for inference [18] This analysisis based on the assumption that the FFT size is equal to the imagesize and FFT is conducted on the entire input image We refer to theconvolution under such assumption as the native frequency domainconvolution We should note that the above complexity equationsets the theoretical lower bound under the assumption hard to besatised by realistic hardware Generally variance in the imagesizes is very large for dierent layers and the target hardware doesnot have such high exibility for FFT of various sizes Ecient FFThardware designs such as the radix-x based architecture [11] canonly support limited number of distinct FFT sizes (eg sizes of x i )under reasonable amount of runtime reconguration In additioneven if the hardware is able to support all the FFT sizes required by

dierent layers using such hardware results in low resource e-ciency since the gap between the maximum and minimum imagesizes is very large The problem of hardware runtime inexibility isaddressed by the OaA technique as tiling mitigates image scalingacross layers Although recent FPGA implementation of CNNs us-ing OaA has shown good performance [28] a detailed analysis ofthis technique specically in the CNN context is needed to optimizethe implementation

This section proposes two general algorithmic optimizationswhich reduce the computation complexity of frequency domainconvolution at low hardware cost Starting from theOaA complexityanalysis when applied to CNN we rst propose a hybrid algorithmcombining frequency domain convolution with and without OaAWe show that this approach performs well for both large and smallimages and requires only small FFT sizes Then we further improvethe algorithmrsquos exibility and performance by proposing a dualoperation of OaA which we call Concatenate and Pad (CaP) Dueto the limited number of FFT sizes supported in hardware wepropose a second hybrid algorithm combining OaA and OaA-CaPwith the native frequency domain convolution It achieves furthercomputation complexity reduction and hardware simplicationwithout requiring any FFT hardware reconguration The proposedalgorithm is a key component in our algorithm-architecture co-design

31 Optimization for Images of Dierent ScalesImage scaling is common in CNNs Table 1 shows that from the rstconvolution layer to the last images shrink more than 17 times inAlexNet The kernel sizes are reduced accordingly Previous study[28] dealt with images of dierent scales by partitioning the largeimages using OaA We further analyze the computation complexityof the OaA approach when applied to CNNs and show how FFT ofsmall sizes can reduce the number of operations signicantly forall convolution layers

By using OaA we perform FFT Hadamard product and IFFTon each input tile Following the notation in Section 22 the totalnumber of operations needed by convolving a full image can becalculated as

(4)Ototal =

(Ot ileF FT +Ot ileHadamard +Ot ile I F FT

+Ot ileOaA)middotlceil limд + lkern minus 1

N minus lkern + 1rceil2

whereOt ileF FT = C1 middot N 2 middot logN middot finOt ile I F FT = C2 middot N 2 middot logN middot foutOt ileHadamard = C2 middot N 2 middot fin middot foutOt ileOaA = C3 middot N middot (lkern minus 1)fout

And Clowast are constants for the cost of addition or multiplicationPrevious work [12 28] note that this equation is asymptotically

bounded by O(l2imд middot logN ) and considers it to be superior tothe native FFT whose complexity is O(l2imд middot log limд ) We arguethat this conclusion is valid only for single frame convolution anddoes not apply to CNNs when fin fout come into picture Most ofthe time except for the rst layer fin fout are at least as large asN or limд so for the domain of the CNN problem Ot ileF FT and

1 2 3 4 5

Convolution layers

0

1

2

3

4

Gig

aO

pera

tions Spatial

FFT-hybdOaA-16OaA-32OaA-64OaA-128

Figure 1 AlexNet Eect of FFT size on total number of oper-ations ldquoFFT-hybd is produced by the hybrid algorithm us-ing native FFT convolution and FFT convolution with OaA

Ot ile I F FT are much less thanOt ileHadamard (logN fin fout )Thus the complexity of OaA when applied to CNN should insteadbe O(l2imд middot fin middot fout ) Equation 4 can be approximated as

Oapproxtotal asymp C2 middot fin middot fout middot

(limд + lkern minus 11 minus (lkern minus 1)N

)2(5)

The domain of N is lkern minus 1 lt N 6 limд + lkern minus 1Note that partO

partN = minusC4 middot N(Nminuslkern+1)3

Thus we observe that(1) number of operations decreases as N increases (2) the ben-et of increasing N diminishes when N becomes suciently largerthan (lkern minus 1) By Observation (1) N should not fall in the left endregion of its domain By Observation (2) N should not be in the rightend region In this case there is little benet to increase N to matchthe image size when limд is very large Thus it can be concludedthat the middle ground between (lkern minus 1) and (limд + lkern minus 1)is the most suitable region for N

The motivation for OaA is to avoid using large-size FFT whendealing with large images Thus for small images we can simplyapply native frequency domain convolution Then the number ofoperations equals toC2 middot fin middot fout middot (N prime)2 where N prime is the smallestFFT size such that N prime gt limд + lkern minus 1 We can use the hybridalgorithm that uses the frequency domain convolution with OaAas well as the native frequency domain convolution

Figure 1 illustrates our analysis for AlexNet By using the OaAtechnique FFT with small variance in sizes can process quite wellimages of dierent scales The green line shows the performanceof the hybrid frequency domain algorithm The rst two layers useOaA based convolution for 32 16 points and the last three layersuse native frequency convolution for 32 16 16 points respectivelyCompared with spatial convolution our approach reduces the num-ber of operations by 506 using only two small FFT sizes of 16and 32

This section proposes an optimization by applying OaA in theCNN context The optimization greatly relieves the recongura-tion requirement for FFT hardware while preserving the nativefrequency domain convolutionrsquos advantage in terms of compu-tation complexity We will show in the next section our secondoptimization which further reduces the number of operations andcompletely eliminates the need for hardware reconguration

32 Optimization for Exploiting LimitedNumber of FFT Sizes

The analysis of Section 31 on both OaA based and the native FFT ne-glects the wasted computation incurred by image padding to matchFFT sizes However this may result in sub-optimality implemen-tations due to the limitations of realistic FFT hardware Withoutreasonable amount of reconguration high performance FFT hard-ware is only able to support limited number of dierent FFT sizesAs an example consider the case of radix-4 FFT design The FFTsizes of 4 16 64 256 need to handle images of any sizes rangingfrom ten to several hundred

The key problem is mismatch between image sizes and FFTsizes Since we have little control over the FFT sizes under thehardware constraints we address this problem by changing imagesizes through input data reshaping

Algorithm 1 Operations by a convolution layer1 procedure ConvLayer2 I is the input image array3 K is the kernel array4 for ib = 1 to Batch do5 for iout = 1 to fout do6 for iin = 1 to fin do7 T in[ib ][iout ][iin]larr I [ib ][iin] lowast K[iout ][iin]8 end for

9 T out [ib ][iout ]larrfinsumj=1

T in[ib ][iout ][j]

10 end for11 end for12 end procedure

The operation of the CNN can be summarized by Algorithm 1The operation under concern is the convolution in line 7 For thenested loop we would like to manipulate one of the Batch fout fin dimensions so that limд of image I can be resized Index ibtakes eect on I and is independent of K so Batch is the appro-priate dimension We refer to the manipulation of Batch and limдdimensions as the reshaping of input data

We propose a dual operation of OaA Concatenate and Pad (CaP)and apply it together with OaA to have the exibility on the limдdimension We need to ensure that reshaping does not aect theoutput result of convolution and introduces negligible overhead

We dene the CaP operation as follow Given a set of d2 imagesI of equal size limд times limд we arrange the images in a d times d meshwith x pixels of zero value between the vertically or horizontallyadjacent images CaP outputs a large image ICaP by concatenatingI and the zero pixels in between x is dened as the size of zeropadding Figure 2 illustrates such process The following theoremshows that by choosing appropriate x we can produce the correctoutput of convolution

Theorem 31 For a given kernel K of size lkern times lkern theconvolution using K on a set of images I is identical to performingconvolution using K on ICaP i the padding x is at least lkern minus 1

Proof The proof of xmin = lkern minus 1 can be done from theperspective of spatial or frequency domain convolution We prove

2

3

4

$

1

1

4

2

3

$amp

()

Figure 2 Illustration of the CaP algorithm

the theorem by OaA in frequency domain to illustrate the dualoperation

Dene as the partition operation by OaA (I lt ile ) outputsa set of tiles after partitioning I into lt ile times lt ile tiles T Deneoplusk as the concatenation operation oplusk (I ) outputs a large image byconcatenating the 2D mesh of I with k pixels of margin betweenadjacent images oplusminusk (I ) outputs an image by concatenating the2D mesh of I with k pixels of overlap between adjacent imagesDene Tp as the matrix by zero-padding T with p pixels after thelast pixels along the two dimensions

Performing convolution after CaP can be expressed by the fol-lowing set of equations

ICaP = oplusx (I )

Ix = limд+x (ICaP )

T =Fminus1(F ((Ix )lkernminus1) F (Kx+limдminus1)

)=Ix lowast K = (I lowast K )x = T x

lowast

ICaP lowast K = oplusminus(lkernminus1) (T ) = oplusminus(lkernminus1)(T xlowast )

= oplusxminus(lkernminus1) (Tlowast)

= oplusxminus(lkernminus1) (I lowast K )

From the oplusxminus(lkernminus1) operator in the last step it can be con-cluded that convolution on ICaP is identical to convolution on theset of I i x minus (lkern minus 1) gt 0 Thus the minimum value of x toavoid aliasing between adjacent images is (lkern minus 1)

Based on Theorem 31 we can CaP the images from a batch sothat input of Batchtimes fin times fout times l2imд is reshaped to Batch

d2 times fin times

fout times (lCaPimд )2 where lCaPimд = d middot limд + (d minus 1) middot lkern and d is

referred to as the Batch folding factor We can then apply the OaAtechnique on ICaPimд to complete the convolution Abbreviate FFTbased convolution with OaA as OaA and FFT based convolutionwith CaP and OaA as CaP-OaA We analyze in a convolution layerhow much improvement can be brought by CaP-OaA comparedwith OaA and the native approach We assume that the hardwaresupports the following FFT sizesN = [N1N2 Nm] where N1 ltN2 lt lt Nm Theorem 32 evaluates the optimal complexity forCaP-OaA Let Ki =

limд+lkernminus1Niminus(lkernminus1)

Theorem 32 The minimum computation complexity of CaP ap-plied to a CNN isC2 middot fin middot fout middotK2

m middotN2m whereC2 is some constant To

achieve such complexity the Batch folding factor should be a multipleof Nmminus(lkernminus1)

gcd (limд+lkernminus1Nmminus(lkernminus1))

Proof Based on Equation 4 the complexity of CaP-OaA ap-proach averaged for a single image is

OCaPminusOaA =C2 middot fin middot fout middotlceil lCaPimд + lkern minus 1Ni minus (lkern minus 1)

rceil2middotN 2i

d2

=C2 middot fin middot fout middotlceild middot Ki

rceil2middotN 2i

d2

(6)

The minimum computation complexity can thus be obtained bythe following inequality

OCaPminusOaA gt C2 middot fin middot fout middot K2i middot N

2i gt C2 middot fin middot fout middot K2

m middot N2m

We can set d as a multiple of Nmminus(lkernminus1)gcd (limд+lkernminus1Nmminus(lkernminus1))

to eliminate the ceiling function and choose Ni = Nm ThenminOCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Thus OminCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Corollary 33 The computation complexity of CaP-OaA is al-ways no higher than OaA The minimum ratio of CaP-OaArsquos number

of operations over OaArsquos number of operations ismaxi( KmdKi e

middotNmNi

)2

Corollary 34 The computation complexity of CaP-OaA is lowerthan native FFT if Nl

Nmgt Kl (where Nl denotes the minimum FFT size

such that Nl gt limд + lkern minus 1) The minimum ratio of CaP-OaArsquosnumber of operations over native frequency domain convolutionrsquos

number of operations is(Km middotNm

Nl

)2

The corollaries can be easily proven by directly comparing theoptimal CaP-OaA complexity with the optimal OaA complexity andthe optimal native approach complexity For OaA since we haveno further information to evaluate the eect of the ceiling functionwe need to calculate dKi e middot Ni for all Ni 6 limд + lkern minus 1

Figure 3 shows the eect of including CaP in the frequency do-main convolution We compare the algorithm of CaP-OaA with thehybrid algorithm using native approach and OaA We plot undertwo dierent FFT constraints The blue points are under the assump-tion of N = 4 16 64 256 The red dots show the performance inthe case of more hardware support N = 4 8 16 32 64 128 256The kernel size is assumed to be 3 times 3 There are very few pointswhere CaP-OaA is slightly worse than the native approach Mostof the time for both FFT sequences the extra exibility brought byCaP benets the computation complexity signicantly

One additional benet of applying CaP-OaA is that it makesone-size FFT sucient for all layers Images in deeper layers canbe CaP-transformed to eliminate the domain constraint of N 6limд + lkern minus 1 discussed in Section 31 Evaluation on AlexNetshows that the CaP-OaA technique reduces 575 of computationcompared with spatial convolution using only one-size FFT of size32

0 30 60 90 120 150 180 210 240

Image Size

02

04

06

08

10

Red

uctio

nin

Com

puta

tion

Com

plex

ityby

CaP

FFT Seq 1FFT Seq 2

Figure 3 Eect of CaP optimization y axis is calculated by dividing the number of operations of CaP-OaA by the number ofoperations of the hybrid algorithm discussed in Section 31

33 Discussion on the Various OptimizationTechniques

The various optimization techniques discussed so far can be viewedin a unied way By setting Batch folding factor to 1 Cap-OaAreduces to OaA By setting N = limд + lkern minus 1 OaA reduces tothe native frequency domain convolution Note that there is nopre-processing overhead incurred by OaA or CaP CaP completelyeliminates the need to recongure hardware when processing dif-ferent layers

We can perform optimization by an hybrid algorithm that com-bines all the three techniques native approach OaA and CaP-OaAThe three algorithms can be applied separately on dierent lay-ers Signicant reduction in computation complexity can thus beachieved for arbitrary CNNs with very little hardware

We should also notice that CaP is based on batch processing Inthis case although throughput is further optimized latency maybe compromised Thus using hybrid algorithm that uses OaA andnative algorithm can result in a low latency design while furtherinclusion of CaP-OaA can result in a high throughput design

4 SYSTEM ARCHITECTUREThis section gives an overview of the architecture design on FPGAto perform the hybrid algorithm for frequency domain convolutionThe architecture consists of three main modules to perform FFTHadamard product and IFFT operations on the image and kerneldata

41 Streaming 2D FFT ModuleThe 2D FFT on an N timesN matrix involves two phases of computationIn the rst phase we perform N -point 1D FFT on the N rows ofthe input matrix In the second phase we perform N -point 1D FFTon the N columns of phase 1rsquos output matrix Matrix transposeoperation is involved between the two phases We design resourceecient hardware for both the 1D FFT and the matrix transposeoperation

The radix-x N -point streaming 1D FFT architecture consists oflogx N computation stages To meet the bandwidth requirement ofinput data we can fold each stage fF FT times so that input paral-lelism of stage 0 is decreased from N words to N fF FT Streamingpermutation of sizemi = N

x iminus1 is required between stages i and (i+1)to support folding [6] Thus the total memory consumption by thepermutation network for the logx N stages is 2N + sum

mi asymp 3N where the 2N term is due to the bit-reversal in the last stage [7]

To perform the matrix transpose of size N times N between the twophases of 2D FFT we utilize the Streaming Permutation Network(SPN) proposed in [8] The SPN can perform arbitrary stride per-mutation by the in-place permutation in time algorithm The SPNonly requires size N times N single-port memory

Our proposed streaming 2D FFT module consists of PF FT unitsof N times N 2D FFT Each 2D FFT unit can be folded q2DF FT timesso that it contains N q2DF FT pipelines of N -point 1D FFT Each1D FFT pipeline can also be folded q1DF FT times so that the dataparallelism of each pipeline is N q1D The total data parallelism tothe 2D FFT module is PF FT middotN 2

q1DF FT middotq2DF FT

42 Multiply-and-Accumulate ModuleThe Hadamard product is performed by the Multiply-and-Accumulate(MAC) module Inputs to the MAC module are the kernel and imagetiles after the Fourier transform Outputs of the MAC are the resul-tant tiles after the Hadamard product computation The Pimд imagetiles and Pkern kernel tiles fed into the MAC module in one cycleshould have the same index in the fin dimension Then after fincycles the accumulator can complete the reduction on dimensionfin Data parallelism of Pimд middot Pkern middotN

2 is required to support thefull permutation of the set of Pimд image tiles and the set of Pkernkernel tiles If input is available every cycle the output rate of theMAC module is Pimд middotPkern tiles per fin cycles The required inputrate for image tiles is Pimд tiles per fout Pkern cycles Thus theratio of the input and output rate is finfout We can easily verifythat the MAC module design is work optimal

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

Optimizing Frequency Domain Implementation of CNNs onFPGAs

Hanqing Zeng Ren Chen Viktor K PrasannaMing Hsieh Department of Electrical Engineering

University of Southern CaliforniaLos Angeles California 90089

zenghrenchenprasannauscedu

ABSTRACTConvolutional neural networks (CNNs) have become popular formany machine learning applications Recently accelerators eitherin ASIC or using FPGA have been proposed to achieve low latencyfor classication as well as reduce the energy consumption How-ever with increasing variety of deep learning models and prolifer-ation of FPGA devices and families realizing a high-performancedesign becomes challenging In this paper we propose an algorithm-architecture co-design methodology based on the computationalcharacteristics of CNN models and the features of underlying hard-ware to realize high performance designs To speed up various CNNmodels our methodology consists of two levels of optimization forconvolution in the frequency domain (1) At the algorithm levelwe reduce the total number of operations We propose a new tech-nique called Concatenate and Pad (CaP) By applying it with its dualoperation Overlap and Add (OaA) we can match various imagesizes to any given FFT size Our hybrid algorithm based on nativeFFT OaA-FFT and CaP-OaA-FFT achieves signicant computationreduction for a wide range of CNNs (2) At the architecture levelour goal is to maximize the utilization of limited on-chip resourcesWe develop highly ecient hardware building blocks to support ouralgorithmic optimizations We develop a simplied performancemodel which captures the constraints of a given hardware devicesuch as external bandwidth logic resources and on-chip memoryThe performance model leads to a reduced design space which isexplored using bounded parameter scan to realize a high perfor-mance FPGA design for a CNN We illustrate our methodology byproposing two FPGA designs for AlexNet using throughput andlatency as performance metrics Experimental results show that thetwo designs achieve throughput of 1634 GOPS and latency of 124ms respectively on the Intel HARP heterogeneous platform Ourdesigns signicantly improve the state-of-the-art implementationsof AlexNet on FPGAs

KEYWORDSConvolutional Neural Networks Fast Fourier Transform FPGAOverlap and Add Optimization

1 INTRODUCTIONConvolutional Neural Networks (CNNs) are gaining increasingpopularity in a variety of applications including computer visionsignal processing and natural language processing [14 24 26] Toachieve better accuracy deeper and larger CNN models are beingexplored This leads to high throughput requirement [19] To embedCNNs into real time applications single image classifcation time is

critical This results in low latency requirement [4] Furthermorewith the progress in deep learning there is increasing diversityin terms of CNN models and accelerators Therefore a exibleaccelerator design for large scale CNNs to realize high throughputand low latency becomes a crucial problem to address

In recent years many CNN accelerators have been implementedon GPU [3 13 22] ASIC [10] and FPGA [20 25 27] Energy ef-ciency recongurability and unprecedented logic density makeFPGAs attractive for high performance CNN processing Yet di-culties exist to fully make use of such platforms in deep learningapplications The rst challenge is to deal with the large problemsize Motivated by the huge amount of operations required by spa-tial convolution (the convolution that performs sliding windowoperation of the kernel over the image) alternatives such as Wino-grad transform [15] and frequency domain convolution [18] havebeen proposed Recent implementations of these algorithms haveshown promising results [5 28] The other challenge lies in largevariance of the parameters among dierent convolution layerssuch as the image size kernel size and number of feature maps Forresource eciency the same hardware module is reused for dier-ent layers Yet some hardware modules donrsquot have the exibilty toperform well under the computational requirements of dierentCNN layers A design containing such modules is very inecientbecause large amounts of runtime reconguration will be requiredby the many layers in a deep CNN In addition the design space offeasible hardware designs over all the layers can be huge Limiteddesign space exploration in [16] takes as long as 5 hours on twodesktops

In this work we improve the performance of CNNs runningon FPGAs by identifying designs that perform well on all convo-lution layers We propose an optimization ow with algorithm-architecture co-design based on convolution in frequency domainTo deal with the variance of image sizes and kernel sizes we proposea dual operation of Overlap and Add (OaA) [12] called Concatenateand Pad (CaP) We prove that frequency domain convolution usingCaP-OaA performs equally well as the native frequency domainconvolution in terms of number of operations Yet the CaP-OaAapproach needs no hardware reconguration for dierent layerswhile the native frequency domain convolution incurs signicantreconguration overhead for every layer We further propose ahybrid algorithm which combines OaA and CaP-OaA with the na-tive frequency domain convolution The hybrid algorithm is ableto specically optimize either throughput or latency for a givenCNN To address the variance in the number of feature maps wego into the architecture level by proposing an ecient design space

exploration algorithm We rst develop high performance hard-ware modules to support our hybrid convolution algorithm andderive an analytical performance model for our architecture Ourdesign space exploration algorithm is based on the observation thatoptimal conguration of a single layer is a good indication for theoptimal conguration of the entire network Thus the design spacefor a complete CNN can be bounded by the local optimum of eachindividual layer

The main contributions of this work arebull We analyze the complexity of frequency domain convolu-

tion using Overlap and Add (OaA) specically in the CNNcontext We propose a dual operation of OaA called Con-catenate and Add (CaP) and design a hybrid algorithmthat combines OaA CaP and the native frequency domainconvolution

bull We show that the hybrid algorithm signicantly reducesthe computation complexity for a wide range of CNN mod-els Compared with spatial convolution it results in 575reduction in the number of operations for AlexNet withoutneeding for runtime reconguration

bull We derive a comprehensive performance model whichtakes into account details of the CNN and hardware con-straints such as peak bandwidth to external memory andavailable on-chip memory logic and DSP resources Wepropose a methodology to signicantly reduce the designspace of this performance model

bull We develop a highly-optimized hardware design with eachindividual modules in the form of parameterized soft IPcores The building blocks such as run-time congurable2D FFT module can be easily congured for dierent CNNsand FPGAs

bull We implement our design on the Intel HARP heteroge-neous platform where the FPGA is responsible for our hy-brid algorithm and the CPU performs other light-weightoperations such as overlapping and padding

bull We demonstrate our design methodology by developingtwo designs for AlexNet one optimized for high through-put and the other for low latency respectively The designsshow throughput of 1634 GOPS and latency of 124ms onthe Intel HARP platform [1]

2 BACKGROUND AND RELATEDWORKState-of-the-art convolutional neural networks for image classi-cation can achieve very high accuracy over a thousand inputcategories [14 24] A CNN is formed by stacking various layerstogether A typical design includes four types of layers convolutionlayers rectied linear unit (ReLU) layers pooling layers and fullyconnected layers Convolution layers extract features out of the in-put 2D images Pooling layers enable the network to gradually shiftits focus from local features to global features ReLU layers add non-linearity into the network Finally the partially processed featuresare fed into fully connected layers to complete the classication

21 Convolution in Frequency DomainLet the input matrix to a convolution layer be I whose dimensionis fin times limд times limд the kernel of the layer be K whose dimension

is fout times fin times lkern times lkern The convolution layer does the con-volution operation over the last two dimensions of I and K andaccumulates over fin dimension As a result the output to thatlayer is of dimension fout times l

prime

imд times lprime

imд Spatial convolution basically performs sliding window opera-

tion Alternatively a more ecient algorithm is by computing inthe frequency domain [18] Convolution in the spatial domain isequivalent to Hadamard product () in the frequency domain Thealgorithm is summarized by Equation 1 where F is Fourier trans-form and Fminus1 denotes inverse Fourier transform

I lowast K = Fminus1(F (I ) F (K )

)(1)

There are two noteworthy points First of all typical convolu-tion is companied by stride S Striding is applied directly to thesliding window in spatial convolution and can be achieved in thefrequency domain by subsampling the output matrix after the in-verse Fourier transform Secondly the I and K matrix need to bepadded to the same size before the Fourier transform This ensuresthat the Hadamard product is valid Since the input images to therst few convolution layers are quite large computing FFT on thefull I matrix is not always preferable We analyze the frequencydomain convolution with various image sizes in Section 3

22 Overlap and Add (OaA)To avoid the operation on a large matrix the idea is to divide theoriginal image matrix into smaller tiles After performing frequencydomain operation on each tile with the kernel the resulting tilesare combined to form the nal output [2]

The following describes the basic procedures of computing FFTusing Overlap and Add (OaA) technique Given an limд timeslimд inputimage I and an lkern times lkern kernel K we convolve them using Npoint 2D FFT units (subject to N gt lkern ) First we divide I intotilesT ini j of size lt ile times lt ile (where lt ile + lkern minus 1 = N ) Then afterzero padding to size N timesN we can compute the intermediate outputtiles using Equation 2 To get the nal matrix R we move the outputtiles T outi j so that pixel (0 0) of each tile is overlapped with pixel(i middot lt ile j middot lt ile ) of R Each pixel in R is the sum of correspondingpixels in Ri j The operation is summarized by Equation 3

T outi j = Fminus1(F (T ini j ) F (K )

)(2)

R[p][q] =sumi j

(T outi j [p minus i middot lt ile ][q minus j middot lt ile ]

)where

0 6 p minus i middot lt ile lt lt ile

0 6 q minus j middot lt ile lt lt ile

(3)

In the above equations the square brackets [lowast][lowast] indicates thepixel index within the 2D matrix and all indices i jpq starts from0

The Overlap and Add technique together with convolution infrequency domain reduces the complexity of spatial convolutionfrom O(l2imд middot l

2kern ) to O(l2imд middot log lkern ) [12] for a single frame

Further analysis of such complexity in the CNN context will beconducted in Section 3

Table 1 AlexNet Specication

Layer Image Size Kernel Size Input Features

CONV1 224 times 224 11 times 11 3 (RGB)

CONV2 55 times 55 5 times 5 96

CONV3 27 times 27 3 times 3 256

CONV4 13 times 13 3 times 3 384

CONV5 13 times 13 3 times 3 384

23 CNN ModelsImageNet competition in recent years has produced many well-designed CNNs for image classication [14 24] Popular designsuse multiple convolution layers which gradually extract featuresfrom local to global scale when proceeding to deeper layers Thecomputation is conducted on data of high dimensions (imagesBatchtimes fin timeslimд timeslimд kernels fout times fin timeslkern timeslkern ) whereBatch species the size for batch processing From the specicationsshown in Table 1 we observe how the network extracts out moreand more features when images are transformed to lower and lowerresolution

24 CNN AcceleratorsTo speedup the inferencing of large scale CNNs much eort hasbeen spent with the help of dedicated hardware accelerators Projectsin [16 23 25 27] target at spatial convolution They optimize thehardware by applying techiniques such as loop tilingunrolling anddata mapping Little optimization is done from the computationcomplexity point of view and the design space of their designs canbe huge On the other hand works in [5 9 28] attempt to applyWinograd transform or frequency domain convolution to reducethe number of operations The promising results shown by theminspire further study into these algorithms Our work proceedsfrom [28] to further optimize frequency domain convolution onFPGAs with a novel algorithm-architecture co-design

3 ALGORITHM LEVEL OPTIMIZATIONPerforming convolution in frequency domain will reduce the com-putation complexity from O

(l2imд middotl

2kern middot fin middot fout

)to O

(l2imд middot fin middot

fout + l2imд middot log limд middot (fin + fout ))

for inference [18] This analysisis based on the assumption that the FFT size is equal to the imagesize and FFT is conducted on the entire input image We refer to theconvolution under such assumption as the native frequency domainconvolution We should note that the above complexity equationsets the theoretical lower bound under the assumption hard to besatised by realistic hardware Generally variance in the imagesizes is very large for dierent layers and the target hardware doesnot have such high exibility for FFT of various sizes Ecient FFThardware designs such as the radix-x based architecture [11] canonly support limited number of distinct FFT sizes (eg sizes of x i )under reasonable amount of runtime reconguration In additioneven if the hardware is able to support all the FFT sizes required by

dierent layers using such hardware results in low resource e-ciency since the gap between the maximum and minimum imagesizes is very large The problem of hardware runtime inexibility isaddressed by the OaA technique as tiling mitigates image scalingacross layers Although recent FPGA implementation of CNNs us-ing OaA has shown good performance [28] a detailed analysis ofthis technique specically in the CNN context is needed to optimizethe implementation

This section proposes two general algorithmic optimizationswhich reduce the computation complexity of frequency domainconvolution at low hardware cost Starting from theOaA complexityanalysis when applied to CNN we rst propose a hybrid algorithmcombining frequency domain convolution with and without OaAWe show that this approach performs well for both large and smallimages and requires only small FFT sizes Then we further improvethe algorithmrsquos exibility and performance by proposing a dualoperation of OaA which we call Concatenate and Pad (CaP) Dueto the limited number of FFT sizes supported in hardware wepropose a second hybrid algorithm combining OaA and OaA-CaPwith the native frequency domain convolution It achieves furthercomputation complexity reduction and hardware simplicationwithout requiring any FFT hardware reconguration The proposedalgorithm is a key component in our algorithm-architecture co-design

31 Optimization for Images of Dierent ScalesImage scaling is common in CNNs Table 1 shows that from the rstconvolution layer to the last images shrink more than 17 times inAlexNet The kernel sizes are reduced accordingly Previous study[28] dealt with images of dierent scales by partitioning the largeimages using OaA We further analyze the computation complexityof the OaA approach when applied to CNNs and show how FFT ofsmall sizes can reduce the number of operations signicantly forall convolution layers

By using OaA we perform FFT Hadamard product and IFFTon each input tile Following the notation in Section 22 the totalnumber of operations needed by convolving a full image can becalculated as

(4)Ototal =

(Ot ileF FT +Ot ileHadamard +Ot ile I F FT

+Ot ileOaA)middotlceil limд + lkern minus 1

N minus lkern + 1rceil2

whereOt ileF FT = C1 middot N 2 middot logN middot finOt ile I F FT = C2 middot N 2 middot logN middot foutOt ileHadamard = C2 middot N 2 middot fin middot foutOt ileOaA = C3 middot N middot (lkern minus 1)fout

And Clowast are constants for the cost of addition or multiplicationPrevious work [12 28] note that this equation is asymptotically

bounded by O(l2imд middot logN ) and considers it to be superior tothe native FFT whose complexity is O(l2imд middot log limд ) We arguethat this conclusion is valid only for single frame convolution anddoes not apply to CNNs when fin fout come into picture Most ofthe time except for the rst layer fin fout are at least as large asN or limд so for the domain of the CNN problem Ot ileF FT and

1 2 3 4 5

Convolution layers

0

1

2

3

4

Gig

aO

pera

tions Spatial

FFT-hybdOaA-16OaA-32OaA-64OaA-128

Figure 1 AlexNet Eect of FFT size on total number of oper-ations ldquoFFT-hybd is produced by the hybrid algorithm us-ing native FFT convolution and FFT convolution with OaA

Ot ile I F FT are much less thanOt ileHadamard (logN fin fout )Thus the complexity of OaA when applied to CNN should insteadbe O(l2imд middot fin middot fout ) Equation 4 can be approximated as

Oapproxtotal asymp C2 middot fin middot fout middot

(limд + lkern minus 11 minus (lkern minus 1)N

)2(5)

The domain of N is lkern minus 1 lt N 6 limд + lkern minus 1Note that partO

partN = minusC4 middot N(Nminuslkern+1)3

Thus we observe that(1) number of operations decreases as N increases (2) the ben-et of increasing N diminishes when N becomes suciently largerthan (lkern minus 1) By Observation (1) N should not fall in the left endregion of its domain By Observation (2) N should not be in the rightend region In this case there is little benet to increase N to matchthe image size when limд is very large Thus it can be concludedthat the middle ground between (lkern minus 1) and (limд + lkern minus 1)is the most suitable region for N

The motivation for OaA is to avoid using large-size FFT whendealing with large images Thus for small images we can simplyapply native frequency domain convolution Then the number ofoperations equals toC2 middot fin middot fout middot (N prime)2 where N prime is the smallestFFT size such that N prime gt limд + lkern minus 1 We can use the hybridalgorithm that uses the frequency domain convolution with OaAas well as the native frequency domain convolution

Figure 1 illustrates our analysis for AlexNet By using the OaAtechnique FFT with small variance in sizes can process quite wellimages of dierent scales The green line shows the performanceof the hybrid frequency domain algorithm The rst two layers useOaA based convolution for 32 16 points and the last three layersuse native frequency convolution for 32 16 16 points respectivelyCompared with spatial convolution our approach reduces the num-ber of operations by 506 using only two small FFT sizes of 16and 32

This section proposes an optimization by applying OaA in theCNN context The optimization greatly relieves the recongura-tion requirement for FFT hardware while preserving the nativefrequency domain convolutionrsquos advantage in terms of compu-tation complexity We will show in the next section our secondoptimization which further reduces the number of operations andcompletely eliminates the need for hardware reconguration

32 Optimization for Exploiting LimitedNumber of FFT Sizes

The analysis of Section 31 on both OaA based and the native FFT ne-glects the wasted computation incurred by image padding to matchFFT sizes However this may result in sub-optimality implemen-tations due to the limitations of realistic FFT hardware Withoutreasonable amount of reconguration high performance FFT hard-ware is only able to support limited number of dierent FFT sizesAs an example consider the case of radix-4 FFT design The FFTsizes of 4 16 64 256 need to handle images of any sizes rangingfrom ten to several hundred

The key problem is mismatch between image sizes and FFTsizes Since we have little control over the FFT sizes under thehardware constraints we address this problem by changing imagesizes through input data reshaping

Algorithm 1 Operations by a convolution layer1 procedure ConvLayer2 I is the input image array3 K is the kernel array4 for ib = 1 to Batch do5 for iout = 1 to fout do6 for iin = 1 to fin do7 T in[ib ][iout ][iin]larr I [ib ][iin] lowast K[iout ][iin]8 end for

9 T out [ib ][iout ]larrfinsumj=1

T in[ib ][iout ][j]

10 end for11 end for12 end procedure

The operation of the CNN can be summarized by Algorithm 1The operation under concern is the convolution in line 7 For thenested loop we would like to manipulate one of the Batch fout fin dimensions so that limд of image I can be resized Index ibtakes eect on I and is independent of K so Batch is the appro-priate dimension We refer to the manipulation of Batch and limдdimensions as the reshaping of input data

We propose a dual operation of OaA Concatenate and Pad (CaP)and apply it together with OaA to have the exibility on the limдdimension We need to ensure that reshaping does not aect theoutput result of convolution and introduces negligible overhead

We dene the CaP operation as follow Given a set of d2 imagesI of equal size limд times limд we arrange the images in a d times d meshwith x pixels of zero value between the vertically or horizontallyadjacent images CaP outputs a large image ICaP by concatenatingI and the zero pixels in between x is dened as the size of zeropadding Figure 2 illustrates such process The following theoremshows that by choosing appropriate x we can produce the correctoutput of convolution

Theorem 31 For a given kernel K of size lkern times lkern theconvolution using K on a set of images I is identical to performingconvolution using K on ICaP i the padding x is at least lkern minus 1

Proof The proof of xmin = lkern minus 1 can be done from theperspective of spatial or frequency domain convolution We prove

2

3

4

$

1

1

4

2

3

$amp

()

Figure 2 Illustration of the CaP algorithm

the theorem by OaA in frequency domain to illustrate the dualoperation

Dene as the partition operation by OaA (I lt ile ) outputsa set of tiles after partitioning I into lt ile times lt ile tiles T Deneoplusk as the concatenation operation oplusk (I ) outputs a large image byconcatenating the 2D mesh of I with k pixels of margin betweenadjacent images oplusminusk (I ) outputs an image by concatenating the2D mesh of I with k pixels of overlap between adjacent imagesDene Tp as the matrix by zero-padding T with p pixels after thelast pixels along the two dimensions

Performing convolution after CaP can be expressed by the fol-lowing set of equations

ICaP = oplusx (I )

Ix = limд+x (ICaP )

T =Fminus1(F ((Ix )lkernminus1) F (Kx+limдminus1)

)=Ix lowast K = (I lowast K )x = T x

lowast

ICaP lowast K = oplusminus(lkernminus1) (T ) = oplusminus(lkernminus1)(T xlowast )

= oplusxminus(lkernminus1) (Tlowast)

= oplusxminus(lkernminus1) (I lowast K )

From the oplusxminus(lkernminus1) operator in the last step it can be con-cluded that convolution on ICaP is identical to convolution on theset of I i x minus (lkern minus 1) gt 0 Thus the minimum value of x toavoid aliasing between adjacent images is (lkern minus 1)

Based on Theorem 31 we can CaP the images from a batch sothat input of Batchtimes fin times fout times l2imд is reshaped to Batch

d2 times fin times

fout times (lCaPimд )2 where lCaPimд = d middot limд + (d minus 1) middot lkern and d is

referred to as the Batch folding factor We can then apply the OaAtechnique on ICaPimд to complete the convolution Abbreviate FFTbased convolution with OaA as OaA and FFT based convolutionwith CaP and OaA as CaP-OaA We analyze in a convolution layerhow much improvement can be brought by CaP-OaA comparedwith OaA and the native approach We assume that the hardwaresupports the following FFT sizesN = [N1N2 Nm] where N1 ltN2 lt lt Nm Theorem 32 evaluates the optimal complexity forCaP-OaA Let Ki =

limд+lkernminus1Niminus(lkernminus1)

Theorem 32 The minimum computation complexity of CaP ap-plied to a CNN isC2 middot fin middot fout middotK2

m middotN2m whereC2 is some constant To

achieve such complexity the Batch folding factor should be a multipleof Nmminus(lkernminus1)

gcd (limд+lkernminus1Nmminus(lkernminus1))

Proof Based on Equation 4 the complexity of CaP-OaA ap-proach averaged for a single image is

OCaPminusOaA =C2 middot fin middot fout middotlceil lCaPimд + lkern minus 1Ni minus (lkern minus 1)

rceil2middotN 2i

d2

=C2 middot fin middot fout middotlceild middot Ki

rceil2middotN 2i

d2

(6)

The minimum computation complexity can thus be obtained bythe following inequality

OCaPminusOaA gt C2 middot fin middot fout middot K2i middot N

2i gt C2 middot fin middot fout middot K2

m middot N2m

We can set d as a multiple of Nmminus(lkernminus1)gcd (limд+lkernminus1Nmminus(lkernminus1))

to eliminate the ceiling function and choose Ni = Nm ThenminOCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Thus OminCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Corollary 33 The computation complexity of CaP-OaA is al-ways no higher than OaA The minimum ratio of CaP-OaArsquos number

of operations over OaArsquos number of operations ismaxi( KmdKi e

middotNmNi

)2

Corollary 34 The computation complexity of CaP-OaA is lowerthan native FFT if Nl

Nmgt Kl (where Nl denotes the minimum FFT size

such that Nl gt limд + lkern minus 1) The minimum ratio of CaP-OaArsquosnumber of operations over native frequency domain convolutionrsquos

number of operations is(Km middotNm

Nl

)2

The corollaries can be easily proven by directly comparing theoptimal CaP-OaA complexity with the optimal OaA complexity andthe optimal native approach complexity For OaA since we haveno further information to evaluate the eect of the ceiling functionwe need to calculate dKi e middot Ni for all Ni 6 limд + lkern minus 1

Figure 3 shows the eect of including CaP in the frequency do-main convolution We compare the algorithm of CaP-OaA with thehybrid algorithm using native approach and OaA We plot undertwo dierent FFT constraints The blue points are under the assump-tion of N = 4 16 64 256 The red dots show the performance inthe case of more hardware support N = 4 8 16 32 64 128 256The kernel size is assumed to be 3 times 3 There are very few pointswhere CaP-OaA is slightly worse than the native approach Mostof the time for both FFT sequences the extra exibility brought byCaP benets the computation complexity signicantly

One additional benet of applying CaP-OaA is that it makesone-size FFT sucient for all layers Images in deeper layers canbe CaP-transformed to eliminate the domain constraint of N 6limд + lkern minus 1 discussed in Section 31 Evaluation on AlexNetshows that the CaP-OaA technique reduces 575 of computationcompared with spatial convolution using only one-size FFT of size32

0 30 60 90 120 150 180 210 240

Image Size

02

04

06

08

10

Red

uctio

nin

Com

puta

tion

Com

plex

ityby

CaP

FFT Seq 1FFT Seq 2

Figure 3 Eect of CaP optimization y axis is calculated by dividing the number of operations of CaP-OaA by the number ofoperations of the hybrid algorithm discussed in Section 31

33 Discussion on the Various OptimizationTechniques

The various optimization techniques discussed so far can be viewedin a unied way By setting Batch folding factor to 1 Cap-OaAreduces to OaA By setting N = limд + lkern minus 1 OaA reduces tothe native frequency domain convolution Note that there is nopre-processing overhead incurred by OaA or CaP CaP completelyeliminates the need to recongure hardware when processing dif-ferent layers

We can perform optimization by an hybrid algorithm that com-bines all the three techniques native approach OaA and CaP-OaAThe three algorithms can be applied separately on dierent lay-ers Signicant reduction in computation complexity can thus beachieved for arbitrary CNNs with very little hardware

We should also notice that CaP is based on batch processing Inthis case although throughput is further optimized latency maybe compromised Thus using hybrid algorithm that uses OaA andnative algorithm can result in a low latency design while furtherinclusion of CaP-OaA can result in a high throughput design

4 SYSTEM ARCHITECTUREThis section gives an overview of the architecture design on FPGAto perform the hybrid algorithm for frequency domain convolutionThe architecture consists of three main modules to perform FFTHadamard product and IFFT operations on the image and kerneldata

41 Streaming 2D FFT ModuleThe 2D FFT on an N timesN matrix involves two phases of computationIn the rst phase we perform N -point 1D FFT on the N rows ofthe input matrix In the second phase we perform N -point 1D FFTon the N columns of phase 1rsquos output matrix Matrix transposeoperation is involved between the two phases We design resourceecient hardware for both the 1D FFT and the matrix transposeoperation

The radix-x N -point streaming 1D FFT architecture consists oflogx N computation stages To meet the bandwidth requirement ofinput data we can fold each stage fF FT times so that input paral-lelism of stage 0 is decreased from N words to N fF FT Streamingpermutation of sizemi = N

x iminus1 is required between stages i and (i+1)to support folding [6] Thus the total memory consumption by thepermutation network for the logx N stages is 2N + sum

mi asymp 3N where the 2N term is due to the bit-reversal in the last stage [7]

To perform the matrix transpose of size N times N between the twophases of 2D FFT we utilize the Streaming Permutation Network(SPN) proposed in [8] The SPN can perform arbitrary stride per-mutation by the in-place permutation in time algorithm The SPNonly requires size N times N single-port memory

Our proposed streaming 2D FFT module consists of PF FT unitsof N times N 2D FFT Each 2D FFT unit can be folded q2DF FT timesso that it contains N q2DF FT pipelines of N -point 1D FFT Each1D FFT pipeline can also be folded q1DF FT times so that the dataparallelism of each pipeline is N q1D The total data parallelism tothe 2D FFT module is PF FT middotN 2

q1DF FT middotq2DF FT

42 Multiply-and-Accumulate ModuleThe Hadamard product is performed by the Multiply-and-Accumulate(MAC) module Inputs to the MAC module are the kernel and imagetiles after the Fourier transform Outputs of the MAC are the resul-tant tiles after the Hadamard product computation The Pimд imagetiles and Pkern kernel tiles fed into the MAC module in one cycleshould have the same index in the fin dimension Then after fincycles the accumulator can complete the reduction on dimensionfin Data parallelism of Pimд middot Pkern middotN

2 is required to support thefull permutation of the set of Pimд image tiles and the set of Pkernkernel tiles If input is available every cycle the output rate of theMAC module is Pimд middotPkern tiles per fin cycles The required inputrate for image tiles is Pimд tiles per fout Pkern cycles Thus theratio of the input and output rate is finfout We can easily verifythat the MAC module design is work optimal

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

exploration algorithm We rst develop high performance hard-ware modules to support our hybrid convolution algorithm andderive an analytical performance model for our architecture Ourdesign space exploration algorithm is based on the observation thatoptimal conguration of a single layer is a good indication for theoptimal conguration of the entire network Thus the design spacefor a complete CNN can be bounded by the local optimum of eachindividual layer

The main contributions of this work arebull We analyze the complexity of frequency domain convolu-

tion using Overlap and Add (OaA) specically in the CNNcontext We propose a dual operation of OaA called Con-catenate and Add (CaP) and design a hybrid algorithmthat combines OaA CaP and the native frequency domainconvolution

bull We show that the hybrid algorithm signicantly reducesthe computation complexity for a wide range of CNN mod-els Compared with spatial convolution it results in 575reduction in the number of operations for AlexNet withoutneeding for runtime reconguration

bull We derive a comprehensive performance model whichtakes into account details of the CNN and hardware con-straints such as peak bandwidth to external memory andavailable on-chip memory logic and DSP resources Wepropose a methodology to signicantly reduce the designspace of this performance model

bull We develop a highly-optimized hardware design with eachindividual modules in the form of parameterized soft IPcores The building blocks such as run-time congurable2D FFT module can be easily congured for dierent CNNsand FPGAs

bull We implement our design on the Intel HARP heteroge-neous platform where the FPGA is responsible for our hy-brid algorithm and the CPU performs other light-weightoperations such as overlapping and padding

bull We demonstrate our design methodology by developingtwo designs for AlexNet one optimized for high through-put and the other for low latency respectively The designsshow throughput of 1634 GOPS and latency of 124ms onthe Intel HARP platform [1]

2 BACKGROUND AND RELATEDWORKState-of-the-art convolutional neural networks for image classi-cation can achieve very high accuracy over a thousand inputcategories [14 24] A CNN is formed by stacking various layerstogether A typical design includes four types of layers convolutionlayers rectied linear unit (ReLU) layers pooling layers and fullyconnected layers Convolution layers extract features out of the in-put 2D images Pooling layers enable the network to gradually shiftits focus from local features to global features ReLU layers add non-linearity into the network Finally the partially processed featuresare fed into fully connected layers to complete the classication

21 Convolution in Frequency DomainLet the input matrix to a convolution layer be I whose dimensionis fin times limд times limд the kernel of the layer be K whose dimension

is fout times fin times lkern times lkern The convolution layer does the con-volution operation over the last two dimensions of I and K andaccumulates over fin dimension As a result the output to thatlayer is of dimension fout times l

prime

imд times lprime

imд Spatial convolution basically performs sliding window opera-

tion Alternatively a more ecient algorithm is by computing inthe frequency domain [18] Convolution in the spatial domain isequivalent to Hadamard product () in the frequency domain Thealgorithm is summarized by Equation 1 where F is Fourier trans-form and Fminus1 denotes inverse Fourier transform

I lowast K = Fminus1(F (I ) F (K )

)(1)

There are two noteworthy points First of all typical convolu-tion is companied by stride S Striding is applied directly to thesliding window in spatial convolution and can be achieved in thefrequency domain by subsampling the output matrix after the in-verse Fourier transform Secondly the I and K matrix need to bepadded to the same size before the Fourier transform This ensuresthat the Hadamard product is valid Since the input images to therst few convolution layers are quite large computing FFT on thefull I matrix is not always preferable We analyze the frequencydomain convolution with various image sizes in Section 3

22 Overlap and Add (OaA)To avoid the operation on a large matrix the idea is to divide theoriginal image matrix into smaller tiles After performing frequencydomain operation on each tile with the kernel the resulting tilesare combined to form the nal output [2]

The following describes the basic procedures of computing FFTusing Overlap and Add (OaA) technique Given an limд timeslimд inputimage I and an lkern times lkern kernel K we convolve them using Npoint 2D FFT units (subject to N gt lkern ) First we divide I intotilesT ini j of size lt ile times lt ile (where lt ile + lkern minus 1 = N ) Then afterzero padding to size N timesN we can compute the intermediate outputtiles using Equation 2 To get the nal matrix R we move the outputtiles T outi j so that pixel (0 0) of each tile is overlapped with pixel(i middot lt ile j middot lt ile ) of R Each pixel in R is the sum of correspondingpixels in Ri j The operation is summarized by Equation 3

T outi j = Fminus1(F (T ini j ) F (K )

)(2)

R[p][q] =sumi j

(T outi j [p minus i middot lt ile ][q minus j middot lt ile ]

)where

0 6 p minus i middot lt ile lt lt ile

0 6 q minus j middot lt ile lt lt ile

(3)

In the above equations the square brackets [lowast][lowast] indicates thepixel index within the 2D matrix and all indices i jpq starts from0

The Overlap and Add technique together with convolution infrequency domain reduces the complexity of spatial convolutionfrom O(l2imд middot l

2kern ) to O(l2imд middot log lkern ) [12] for a single frame

Further analysis of such complexity in the CNN context will beconducted in Section 3

Table 1 AlexNet Specication

Layer Image Size Kernel Size Input Features

CONV1 224 times 224 11 times 11 3 (RGB)

CONV2 55 times 55 5 times 5 96

CONV3 27 times 27 3 times 3 256

CONV4 13 times 13 3 times 3 384

CONV5 13 times 13 3 times 3 384

23 CNN ModelsImageNet competition in recent years has produced many well-designed CNNs for image classication [14 24] Popular designsuse multiple convolution layers which gradually extract featuresfrom local to global scale when proceeding to deeper layers Thecomputation is conducted on data of high dimensions (imagesBatchtimes fin timeslimд timeslimд kernels fout times fin timeslkern timeslkern ) whereBatch species the size for batch processing From the specicationsshown in Table 1 we observe how the network extracts out moreand more features when images are transformed to lower and lowerresolution

24 CNN AcceleratorsTo speedup the inferencing of large scale CNNs much eort hasbeen spent with the help of dedicated hardware accelerators Projectsin [16 23 25 27] target at spatial convolution They optimize thehardware by applying techiniques such as loop tilingunrolling anddata mapping Little optimization is done from the computationcomplexity point of view and the design space of their designs canbe huge On the other hand works in [5 9 28] attempt to applyWinograd transform or frequency domain convolution to reducethe number of operations The promising results shown by theminspire further study into these algorithms Our work proceedsfrom [28] to further optimize frequency domain convolution onFPGAs with a novel algorithm-architecture co-design

3 ALGORITHM LEVEL OPTIMIZATIONPerforming convolution in frequency domain will reduce the com-putation complexity from O

(l2imд middotl

2kern middot fin middot fout

)to O

(l2imд middot fin middot

fout + l2imд middot log limд middot (fin + fout ))

for inference [18] This analysisis based on the assumption that the FFT size is equal to the imagesize and FFT is conducted on the entire input image We refer to theconvolution under such assumption as the native frequency domainconvolution We should note that the above complexity equationsets the theoretical lower bound under the assumption hard to besatised by realistic hardware Generally variance in the imagesizes is very large for dierent layers and the target hardware doesnot have such high exibility for FFT of various sizes Ecient FFThardware designs such as the radix-x based architecture [11] canonly support limited number of distinct FFT sizes (eg sizes of x i )under reasonable amount of runtime reconguration In additioneven if the hardware is able to support all the FFT sizes required by

dierent layers using such hardware results in low resource e-ciency since the gap between the maximum and minimum imagesizes is very large The problem of hardware runtime inexibility isaddressed by the OaA technique as tiling mitigates image scalingacross layers Although recent FPGA implementation of CNNs us-ing OaA has shown good performance [28] a detailed analysis ofthis technique specically in the CNN context is needed to optimizethe implementation

This section proposes two general algorithmic optimizationswhich reduce the computation complexity of frequency domainconvolution at low hardware cost Starting from theOaA complexityanalysis when applied to CNN we rst propose a hybrid algorithmcombining frequency domain convolution with and without OaAWe show that this approach performs well for both large and smallimages and requires only small FFT sizes Then we further improvethe algorithmrsquos exibility and performance by proposing a dualoperation of OaA which we call Concatenate and Pad (CaP) Dueto the limited number of FFT sizes supported in hardware wepropose a second hybrid algorithm combining OaA and OaA-CaPwith the native frequency domain convolution It achieves furthercomputation complexity reduction and hardware simplicationwithout requiring any FFT hardware reconguration The proposedalgorithm is a key component in our algorithm-architecture co-design

31 Optimization for Images of Dierent ScalesImage scaling is common in CNNs Table 1 shows that from the rstconvolution layer to the last images shrink more than 17 times inAlexNet The kernel sizes are reduced accordingly Previous study[28] dealt with images of dierent scales by partitioning the largeimages using OaA We further analyze the computation complexityof the OaA approach when applied to CNNs and show how FFT ofsmall sizes can reduce the number of operations signicantly forall convolution layers

By using OaA we perform FFT Hadamard product and IFFTon each input tile Following the notation in Section 22 the totalnumber of operations needed by convolving a full image can becalculated as

(4)Ototal =

(Ot ileF FT +Ot ileHadamard +Ot ile I F FT

+Ot ileOaA)middotlceil limд + lkern minus 1

N minus lkern + 1rceil2

whereOt ileF FT = C1 middot N 2 middot logN middot finOt ile I F FT = C2 middot N 2 middot logN middot foutOt ileHadamard = C2 middot N 2 middot fin middot foutOt ileOaA = C3 middot N middot (lkern minus 1)fout

And Clowast are constants for the cost of addition or multiplicationPrevious work [12 28] note that this equation is asymptotically

bounded by O(l2imд middot logN ) and considers it to be superior tothe native FFT whose complexity is O(l2imд middot log limд ) We arguethat this conclusion is valid only for single frame convolution anddoes not apply to CNNs when fin fout come into picture Most ofthe time except for the rst layer fin fout are at least as large asN or limд so for the domain of the CNN problem Ot ileF FT and

1 2 3 4 5

Convolution layers

0

1

2

3

4

Gig

aO

pera

tions Spatial

FFT-hybdOaA-16OaA-32OaA-64OaA-128

Figure 1 AlexNet Eect of FFT size on total number of oper-ations ldquoFFT-hybd is produced by the hybrid algorithm us-ing native FFT convolution and FFT convolution with OaA

Ot ile I F FT are much less thanOt ileHadamard (logN fin fout )Thus the complexity of OaA when applied to CNN should insteadbe O(l2imд middot fin middot fout ) Equation 4 can be approximated as

Oapproxtotal asymp C2 middot fin middot fout middot

(limд + lkern minus 11 minus (lkern minus 1)N

)2(5)

The domain of N is lkern minus 1 lt N 6 limд + lkern minus 1Note that partO

partN = minusC4 middot N(Nminuslkern+1)3

Thus we observe that(1) number of operations decreases as N increases (2) the ben-et of increasing N diminishes when N becomes suciently largerthan (lkern minus 1) By Observation (1) N should not fall in the left endregion of its domain By Observation (2) N should not be in the rightend region In this case there is little benet to increase N to matchthe image size when limд is very large Thus it can be concludedthat the middle ground between (lkern minus 1) and (limд + lkern minus 1)is the most suitable region for N

The motivation for OaA is to avoid using large-size FFT whendealing with large images Thus for small images we can simplyapply native frequency domain convolution Then the number ofoperations equals toC2 middot fin middot fout middot (N prime)2 where N prime is the smallestFFT size such that N prime gt limд + lkern minus 1 We can use the hybridalgorithm that uses the frequency domain convolution with OaAas well as the native frequency domain convolution

Figure 1 illustrates our analysis for AlexNet By using the OaAtechnique FFT with small variance in sizes can process quite wellimages of dierent scales The green line shows the performanceof the hybrid frequency domain algorithm The rst two layers useOaA based convolution for 32 16 points and the last three layersuse native frequency convolution for 32 16 16 points respectivelyCompared with spatial convolution our approach reduces the num-ber of operations by 506 using only two small FFT sizes of 16and 32

This section proposes an optimization by applying OaA in theCNN context The optimization greatly relieves the recongura-tion requirement for FFT hardware while preserving the nativefrequency domain convolutionrsquos advantage in terms of compu-tation complexity We will show in the next section our secondoptimization which further reduces the number of operations andcompletely eliminates the need for hardware reconguration

32 Optimization for Exploiting LimitedNumber of FFT Sizes

The analysis of Section 31 on both OaA based and the native FFT ne-glects the wasted computation incurred by image padding to matchFFT sizes However this may result in sub-optimality implemen-tations due to the limitations of realistic FFT hardware Withoutreasonable amount of reconguration high performance FFT hard-ware is only able to support limited number of dierent FFT sizesAs an example consider the case of radix-4 FFT design The FFTsizes of 4 16 64 256 need to handle images of any sizes rangingfrom ten to several hundred

The key problem is mismatch between image sizes and FFTsizes Since we have little control over the FFT sizes under thehardware constraints we address this problem by changing imagesizes through input data reshaping

Algorithm 1 Operations by a convolution layer1 procedure ConvLayer2 I is the input image array3 K is the kernel array4 for ib = 1 to Batch do5 for iout = 1 to fout do6 for iin = 1 to fin do7 T in[ib ][iout ][iin]larr I [ib ][iin] lowast K[iout ][iin]8 end for

9 T out [ib ][iout ]larrfinsumj=1

T in[ib ][iout ][j]

10 end for11 end for12 end procedure

The operation of the CNN can be summarized by Algorithm 1The operation under concern is the convolution in line 7 For thenested loop we would like to manipulate one of the Batch fout fin dimensions so that limд of image I can be resized Index ibtakes eect on I and is independent of K so Batch is the appro-priate dimension We refer to the manipulation of Batch and limдdimensions as the reshaping of input data

We propose a dual operation of OaA Concatenate and Pad (CaP)and apply it together with OaA to have the exibility on the limдdimension We need to ensure that reshaping does not aect theoutput result of convolution and introduces negligible overhead

We dene the CaP operation as follow Given a set of d2 imagesI of equal size limд times limд we arrange the images in a d times d meshwith x pixels of zero value between the vertically or horizontallyadjacent images CaP outputs a large image ICaP by concatenatingI and the zero pixels in between x is dened as the size of zeropadding Figure 2 illustrates such process The following theoremshows that by choosing appropriate x we can produce the correctoutput of convolution

Theorem 31 For a given kernel K of size lkern times lkern theconvolution using K on a set of images I is identical to performingconvolution using K on ICaP i the padding x is at least lkern minus 1

Proof The proof of xmin = lkern minus 1 can be done from theperspective of spatial or frequency domain convolution We prove

2

3

4

$

1

1

4

2

3

$amp

()

Figure 2 Illustration of the CaP algorithm

the theorem by OaA in frequency domain to illustrate the dualoperation

Dene as the partition operation by OaA (I lt ile ) outputsa set of tiles after partitioning I into lt ile times lt ile tiles T Deneoplusk as the concatenation operation oplusk (I ) outputs a large image byconcatenating the 2D mesh of I with k pixels of margin betweenadjacent images oplusminusk (I ) outputs an image by concatenating the2D mesh of I with k pixels of overlap between adjacent imagesDene Tp as the matrix by zero-padding T with p pixels after thelast pixels along the two dimensions

Performing convolution after CaP can be expressed by the fol-lowing set of equations

ICaP = oplusx (I )

Ix = limд+x (ICaP )

T =Fminus1(F ((Ix )lkernminus1) F (Kx+limдminus1)

)=Ix lowast K = (I lowast K )x = T x

lowast

ICaP lowast K = oplusminus(lkernminus1) (T ) = oplusminus(lkernminus1)(T xlowast )

= oplusxminus(lkernminus1) (Tlowast)

= oplusxminus(lkernminus1) (I lowast K )

From the oplusxminus(lkernminus1) operator in the last step it can be con-cluded that convolution on ICaP is identical to convolution on theset of I i x minus (lkern minus 1) gt 0 Thus the minimum value of x toavoid aliasing between adjacent images is (lkern minus 1)

Based on Theorem 31 we can CaP the images from a batch sothat input of Batchtimes fin times fout times l2imд is reshaped to Batch

d2 times fin times

fout times (lCaPimд )2 where lCaPimд = d middot limд + (d minus 1) middot lkern and d is

referred to as the Batch folding factor We can then apply the OaAtechnique on ICaPimд to complete the convolution Abbreviate FFTbased convolution with OaA as OaA and FFT based convolutionwith CaP and OaA as CaP-OaA We analyze in a convolution layerhow much improvement can be brought by CaP-OaA comparedwith OaA and the native approach We assume that the hardwaresupports the following FFT sizesN = [N1N2 Nm] where N1 ltN2 lt lt Nm Theorem 32 evaluates the optimal complexity forCaP-OaA Let Ki =

limд+lkernminus1Niminus(lkernminus1)

Theorem 32 The minimum computation complexity of CaP ap-plied to a CNN isC2 middot fin middot fout middotK2

m middotN2m whereC2 is some constant To

achieve such complexity the Batch folding factor should be a multipleof Nmminus(lkernminus1)

gcd (limд+lkernminus1Nmminus(lkernminus1))

Proof Based on Equation 4 the complexity of CaP-OaA ap-proach averaged for a single image is

OCaPminusOaA =C2 middot fin middot fout middotlceil lCaPimд + lkern minus 1Ni minus (lkern minus 1)

rceil2middotN 2i

d2

=C2 middot fin middot fout middotlceild middot Ki

rceil2middotN 2i

d2

(6)

The minimum computation complexity can thus be obtained bythe following inequality

OCaPminusOaA gt C2 middot fin middot fout middot K2i middot N

2i gt C2 middot fin middot fout middot K2

m middot N2m

We can set d as a multiple of Nmminus(lkernminus1)gcd (limд+lkernminus1Nmminus(lkernminus1))

to eliminate the ceiling function and choose Ni = Nm ThenminOCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Thus OminCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Corollary 33 The computation complexity of CaP-OaA is al-ways no higher than OaA The minimum ratio of CaP-OaArsquos number

of operations over OaArsquos number of operations ismaxi( KmdKi e

middotNmNi

)2

Corollary 34 The computation complexity of CaP-OaA is lowerthan native FFT if Nl

Nmgt Kl (where Nl denotes the minimum FFT size

such that Nl gt limд + lkern minus 1) The minimum ratio of CaP-OaArsquosnumber of operations over native frequency domain convolutionrsquos

number of operations is(Km middotNm

Nl

)2

The corollaries can be easily proven by directly comparing theoptimal CaP-OaA complexity with the optimal OaA complexity andthe optimal native approach complexity For OaA since we haveno further information to evaluate the eect of the ceiling functionwe need to calculate dKi e middot Ni for all Ni 6 limд + lkern minus 1

Figure 3 shows the eect of including CaP in the frequency do-main convolution We compare the algorithm of CaP-OaA with thehybrid algorithm using native approach and OaA We plot undertwo dierent FFT constraints The blue points are under the assump-tion of N = 4 16 64 256 The red dots show the performance inthe case of more hardware support N = 4 8 16 32 64 128 256The kernel size is assumed to be 3 times 3 There are very few pointswhere CaP-OaA is slightly worse than the native approach Mostof the time for both FFT sequences the extra exibility brought byCaP benets the computation complexity signicantly

One additional benet of applying CaP-OaA is that it makesone-size FFT sucient for all layers Images in deeper layers canbe CaP-transformed to eliminate the domain constraint of N 6limд + lkern minus 1 discussed in Section 31 Evaluation on AlexNetshows that the CaP-OaA technique reduces 575 of computationcompared with spatial convolution using only one-size FFT of size32

0 30 60 90 120 150 180 210 240

Image Size

02

04

06

08

10

Red

uctio

nin

Com

puta

tion

Com

plex

ityby

CaP

FFT Seq 1FFT Seq 2

Figure 3 Eect of CaP optimization y axis is calculated by dividing the number of operations of CaP-OaA by the number ofoperations of the hybrid algorithm discussed in Section 31

33 Discussion on the Various OptimizationTechniques

The various optimization techniques discussed so far can be viewedin a unied way By setting Batch folding factor to 1 Cap-OaAreduces to OaA By setting N = limд + lkern minus 1 OaA reduces tothe native frequency domain convolution Note that there is nopre-processing overhead incurred by OaA or CaP CaP completelyeliminates the need to recongure hardware when processing dif-ferent layers

We can perform optimization by an hybrid algorithm that com-bines all the three techniques native approach OaA and CaP-OaAThe three algorithms can be applied separately on dierent lay-ers Signicant reduction in computation complexity can thus beachieved for arbitrary CNNs with very little hardware

We should also notice that CaP is based on batch processing Inthis case although throughput is further optimized latency maybe compromised Thus using hybrid algorithm that uses OaA andnative algorithm can result in a low latency design while furtherinclusion of CaP-OaA can result in a high throughput design

4 SYSTEM ARCHITECTUREThis section gives an overview of the architecture design on FPGAto perform the hybrid algorithm for frequency domain convolutionThe architecture consists of three main modules to perform FFTHadamard product and IFFT operations on the image and kerneldata

41 Streaming 2D FFT ModuleThe 2D FFT on an N timesN matrix involves two phases of computationIn the rst phase we perform N -point 1D FFT on the N rows ofthe input matrix In the second phase we perform N -point 1D FFTon the N columns of phase 1rsquos output matrix Matrix transposeoperation is involved between the two phases We design resourceecient hardware for both the 1D FFT and the matrix transposeoperation

The radix-x N -point streaming 1D FFT architecture consists oflogx N computation stages To meet the bandwidth requirement ofinput data we can fold each stage fF FT times so that input paral-lelism of stage 0 is decreased from N words to N fF FT Streamingpermutation of sizemi = N

x iminus1 is required between stages i and (i+1)to support folding [6] Thus the total memory consumption by thepermutation network for the logx N stages is 2N + sum

mi asymp 3N where the 2N term is due to the bit-reversal in the last stage [7]

To perform the matrix transpose of size N times N between the twophases of 2D FFT we utilize the Streaming Permutation Network(SPN) proposed in [8] The SPN can perform arbitrary stride per-mutation by the in-place permutation in time algorithm The SPNonly requires size N times N single-port memory

Our proposed streaming 2D FFT module consists of PF FT unitsof N times N 2D FFT Each 2D FFT unit can be folded q2DF FT timesso that it contains N q2DF FT pipelines of N -point 1D FFT Each1D FFT pipeline can also be folded q1DF FT times so that the dataparallelism of each pipeline is N q1D The total data parallelism tothe 2D FFT module is PF FT middotN 2

q1DF FT middotq2DF FT

42 Multiply-and-Accumulate ModuleThe Hadamard product is performed by the Multiply-and-Accumulate(MAC) module Inputs to the MAC module are the kernel and imagetiles after the Fourier transform Outputs of the MAC are the resul-tant tiles after the Hadamard product computation The Pimд imagetiles and Pkern kernel tiles fed into the MAC module in one cycleshould have the same index in the fin dimension Then after fincycles the accumulator can complete the reduction on dimensionfin Data parallelism of Pimд middot Pkern middotN

2 is required to support thefull permutation of the set of Pimд image tiles and the set of Pkernkernel tiles If input is available every cycle the output rate of theMAC module is Pimд middotPkern tiles per fin cycles The required inputrate for image tiles is Pimд tiles per fout Pkern cycles Thus theratio of the input and output rate is finfout We can easily verifythat the MAC module design is work optimal

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

Table 1 AlexNet Specication

Layer Image Size Kernel Size Input Features

CONV1 224 times 224 11 times 11 3 (RGB)

CONV2 55 times 55 5 times 5 96

CONV3 27 times 27 3 times 3 256

CONV4 13 times 13 3 times 3 384

CONV5 13 times 13 3 times 3 384

23 CNN ModelsImageNet competition in recent years has produced many well-designed CNNs for image classication [14 24] Popular designsuse multiple convolution layers which gradually extract featuresfrom local to global scale when proceeding to deeper layers Thecomputation is conducted on data of high dimensions (imagesBatchtimes fin timeslimд timeslimд kernels fout times fin timeslkern timeslkern ) whereBatch species the size for batch processing From the specicationsshown in Table 1 we observe how the network extracts out moreand more features when images are transformed to lower and lowerresolution

24 CNN AcceleratorsTo speedup the inferencing of large scale CNNs much eort hasbeen spent with the help of dedicated hardware accelerators Projectsin [16 23 25 27] target at spatial convolution They optimize thehardware by applying techiniques such as loop tilingunrolling anddata mapping Little optimization is done from the computationcomplexity point of view and the design space of their designs canbe huge On the other hand works in [5 9 28] attempt to applyWinograd transform or frequency domain convolution to reducethe number of operations The promising results shown by theminspire further study into these algorithms Our work proceedsfrom [28] to further optimize frequency domain convolution onFPGAs with a novel algorithm-architecture co-design

3 ALGORITHM LEVEL OPTIMIZATIONPerforming convolution in frequency domain will reduce the com-putation complexity from O

(l2imд middotl

2kern middot fin middot fout

)to O

(l2imд middot fin middot

fout + l2imд middot log limд middot (fin + fout ))

for inference [18] This analysisis based on the assumption that the FFT size is equal to the imagesize and FFT is conducted on the entire input image We refer to theconvolution under such assumption as the native frequency domainconvolution We should note that the above complexity equationsets the theoretical lower bound under the assumption hard to besatised by realistic hardware Generally variance in the imagesizes is very large for dierent layers and the target hardware doesnot have such high exibility for FFT of various sizes Ecient FFThardware designs such as the radix-x based architecture [11] canonly support limited number of distinct FFT sizes (eg sizes of x i )under reasonable amount of runtime reconguration In additioneven if the hardware is able to support all the FFT sizes required by

dierent layers using such hardware results in low resource e-ciency since the gap between the maximum and minimum imagesizes is very large The problem of hardware runtime inexibility isaddressed by the OaA technique as tiling mitigates image scalingacross layers Although recent FPGA implementation of CNNs us-ing OaA has shown good performance [28] a detailed analysis ofthis technique specically in the CNN context is needed to optimizethe implementation

This section proposes two general algorithmic optimizationswhich reduce the computation complexity of frequency domainconvolution at low hardware cost Starting from theOaA complexityanalysis when applied to CNN we rst propose a hybrid algorithmcombining frequency domain convolution with and without OaAWe show that this approach performs well for both large and smallimages and requires only small FFT sizes Then we further improvethe algorithmrsquos exibility and performance by proposing a dualoperation of OaA which we call Concatenate and Pad (CaP) Dueto the limited number of FFT sizes supported in hardware wepropose a second hybrid algorithm combining OaA and OaA-CaPwith the native frequency domain convolution It achieves furthercomputation complexity reduction and hardware simplicationwithout requiring any FFT hardware reconguration The proposedalgorithm is a key component in our algorithm-architecture co-design

31 Optimization for Images of Dierent ScalesImage scaling is common in CNNs Table 1 shows that from the rstconvolution layer to the last images shrink more than 17 times inAlexNet The kernel sizes are reduced accordingly Previous study[28] dealt with images of dierent scales by partitioning the largeimages using OaA We further analyze the computation complexityof the OaA approach when applied to CNNs and show how FFT ofsmall sizes can reduce the number of operations signicantly forall convolution layers

By using OaA we perform FFT Hadamard product and IFFTon each input tile Following the notation in Section 22 the totalnumber of operations needed by convolving a full image can becalculated as

(4)Ototal =

(Ot ileF FT +Ot ileHadamard +Ot ile I F FT

+Ot ileOaA)middotlceil limд + lkern minus 1

N minus lkern + 1rceil2

whereOt ileF FT = C1 middot N 2 middot logN middot finOt ile I F FT = C2 middot N 2 middot logN middot foutOt ileHadamard = C2 middot N 2 middot fin middot foutOt ileOaA = C3 middot N middot (lkern minus 1)fout

And Clowast are constants for the cost of addition or multiplicationPrevious work [12 28] note that this equation is asymptotically

bounded by O(l2imд middot logN ) and considers it to be superior tothe native FFT whose complexity is O(l2imд middot log limд ) We arguethat this conclusion is valid only for single frame convolution anddoes not apply to CNNs when fin fout come into picture Most ofthe time except for the rst layer fin fout are at least as large asN or limд so for the domain of the CNN problem Ot ileF FT and

1 2 3 4 5

Convolution layers

0

1

2

3

4

Gig

aO

pera

tions Spatial

FFT-hybdOaA-16OaA-32OaA-64OaA-128

Figure 1 AlexNet Eect of FFT size on total number of oper-ations ldquoFFT-hybd is produced by the hybrid algorithm us-ing native FFT convolution and FFT convolution with OaA

Ot ile I F FT are much less thanOt ileHadamard (logN fin fout )Thus the complexity of OaA when applied to CNN should insteadbe O(l2imд middot fin middot fout ) Equation 4 can be approximated as

Oapproxtotal asymp C2 middot fin middot fout middot

(limд + lkern minus 11 minus (lkern minus 1)N

)2(5)

The domain of N is lkern minus 1 lt N 6 limд + lkern minus 1Note that partO

partN = minusC4 middot N(Nminuslkern+1)3

Thus we observe that(1) number of operations decreases as N increases (2) the ben-et of increasing N diminishes when N becomes suciently largerthan (lkern minus 1) By Observation (1) N should not fall in the left endregion of its domain By Observation (2) N should not be in the rightend region In this case there is little benet to increase N to matchthe image size when limд is very large Thus it can be concludedthat the middle ground between (lkern minus 1) and (limд + lkern minus 1)is the most suitable region for N

The motivation for OaA is to avoid using large-size FFT whendealing with large images Thus for small images we can simplyapply native frequency domain convolution Then the number ofoperations equals toC2 middot fin middot fout middot (N prime)2 where N prime is the smallestFFT size such that N prime gt limд + lkern minus 1 We can use the hybridalgorithm that uses the frequency domain convolution with OaAas well as the native frequency domain convolution

Figure 1 illustrates our analysis for AlexNet By using the OaAtechnique FFT with small variance in sizes can process quite wellimages of dierent scales The green line shows the performanceof the hybrid frequency domain algorithm The rst two layers useOaA based convolution for 32 16 points and the last three layersuse native frequency convolution for 32 16 16 points respectivelyCompared with spatial convolution our approach reduces the num-ber of operations by 506 using only two small FFT sizes of 16and 32

This section proposes an optimization by applying OaA in theCNN context The optimization greatly relieves the recongura-tion requirement for FFT hardware while preserving the nativefrequency domain convolutionrsquos advantage in terms of compu-tation complexity We will show in the next section our secondoptimization which further reduces the number of operations andcompletely eliminates the need for hardware reconguration

32 Optimization for Exploiting LimitedNumber of FFT Sizes

The analysis of Section 31 on both OaA based and the native FFT ne-glects the wasted computation incurred by image padding to matchFFT sizes However this may result in sub-optimality implemen-tations due to the limitations of realistic FFT hardware Withoutreasonable amount of reconguration high performance FFT hard-ware is only able to support limited number of dierent FFT sizesAs an example consider the case of radix-4 FFT design The FFTsizes of 4 16 64 256 need to handle images of any sizes rangingfrom ten to several hundred

The key problem is mismatch between image sizes and FFTsizes Since we have little control over the FFT sizes under thehardware constraints we address this problem by changing imagesizes through input data reshaping

Algorithm 1 Operations by a convolution layer1 procedure ConvLayer2 I is the input image array3 K is the kernel array4 for ib = 1 to Batch do5 for iout = 1 to fout do6 for iin = 1 to fin do7 T in[ib ][iout ][iin]larr I [ib ][iin] lowast K[iout ][iin]8 end for

9 T out [ib ][iout ]larrfinsumj=1

T in[ib ][iout ][j]

10 end for11 end for12 end procedure

The operation of the CNN can be summarized by Algorithm 1The operation under concern is the convolution in line 7 For thenested loop we would like to manipulate one of the Batch fout fin dimensions so that limд of image I can be resized Index ibtakes eect on I and is independent of K so Batch is the appro-priate dimension We refer to the manipulation of Batch and limдdimensions as the reshaping of input data

We propose a dual operation of OaA Concatenate and Pad (CaP)and apply it together with OaA to have the exibility on the limдdimension We need to ensure that reshaping does not aect theoutput result of convolution and introduces negligible overhead

We dene the CaP operation as follow Given a set of d2 imagesI of equal size limд times limд we arrange the images in a d times d meshwith x pixels of zero value between the vertically or horizontallyadjacent images CaP outputs a large image ICaP by concatenatingI and the zero pixels in between x is dened as the size of zeropadding Figure 2 illustrates such process The following theoremshows that by choosing appropriate x we can produce the correctoutput of convolution

Theorem 31 For a given kernel K of size lkern times lkern theconvolution using K on a set of images I is identical to performingconvolution using K on ICaP i the padding x is at least lkern minus 1

Proof The proof of xmin = lkern minus 1 can be done from theperspective of spatial or frequency domain convolution We prove

2

3

4

$

1

1

4

2

3

$amp

()

Figure 2 Illustration of the CaP algorithm

the theorem by OaA in frequency domain to illustrate the dualoperation

Dene as the partition operation by OaA (I lt ile ) outputsa set of tiles after partitioning I into lt ile times lt ile tiles T Deneoplusk as the concatenation operation oplusk (I ) outputs a large image byconcatenating the 2D mesh of I with k pixels of margin betweenadjacent images oplusminusk (I ) outputs an image by concatenating the2D mesh of I with k pixels of overlap between adjacent imagesDene Tp as the matrix by zero-padding T with p pixels after thelast pixels along the two dimensions

Performing convolution after CaP can be expressed by the fol-lowing set of equations

ICaP = oplusx (I )

Ix = limд+x (ICaP )

T =Fminus1(F ((Ix )lkernminus1) F (Kx+limдminus1)

)=Ix lowast K = (I lowast K )x = T x

lowast

ICaP lowast K = oplusminus(lkernminus1) (T ) = oplusminus(lkernminus1)(T xlowast )

= oplusxminus(lkernminus1) (Tlowast)

= oplusxminus(lkernminus1) (I lowast K )

From the oplusxminus(lkernminus1) operator in the last step it can be con-cluded that convolution on ICaP is identical to convolution on theset of I i x minus (lkern minus 1) gt 0 Thus the minimum value of x toavoid aliasing between adjacent images is (lkern minus 1)

Based on Theorem 31 we can CaP the images from a batch sothat input of Batchtimes fin times fout times l2imд is reshaped to Batch

d2 times fin times

fout times (lCaPimд )2 where lCaPimд = d middot limд + (d minus 1) middot lkern and d is

referred to as the Batch folding factor We can then apply the OaAtechnique on ICaPimд to complete the convolution Abbreviate FFTbased convolution with OaA as OaA and FFT based convolutionwith CaP and OaA as CaP-OaA We analyze in a convolution layerhow much improvement can be brought by CaP-OaA comparedwith OaA and the native approach We assume that the hardwaresupports the following FFT sizesN = [N1N2 Nm] where N1 ltN2 lt lt Nm Theorem 32 evaluates the optimal complexity forCaP-OaA Let Ki =

limд+lkernminus1Niminus(lkernminus1)

Theorem 32 The minimum computation complexity of CaP ap-plied to a CNN isC2 middot fin middot fout middotK2

m middotN2m whereC2 is some constant To

achieve such complexity the Batch folding factor should be a multipleof Nmminus(lkernminus1)

gcd (limд+lkernminus1Nmminus(lkernminus1))

Proof Based on Equation 4 the complexity of CaP-OaA ap-proach averaged for a single image is

OCaPminusOaA =C2 middot fin middot fout middotlceil lCaPimд + lkern minus 1Ni minus (lkern minus 1)

rceil2middotN 2i

d2

=C2 middot fin middot fout middotlceild middot Ki

rceil2middotN 2i

d2

(6)

The minimum computation complexity can thus be obtained bythe following inequality

OCaPminusOaA gt C2 middot fin middot fout middot K2i middot N

2i gt C2 middot fin middot fout middot K2

m middot N2m

We can set d as a multiple of Nmminus(lkernminus1)gcd (limд+lkernminus1Nmminus(lkernminus1))

to eliminate the ceiling function and choose Ni = Nm ThenminOCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Thus OminCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Corollary 33 The computation complexity of CaP-OaA is al-ways no higher than OaA The minimum ratio of CaP-OaArsquos number

of operations over OaArsquos number of operations ismaxi( KmdKi e

middotNmNi

)2

Corollary 34 The computation complexity of CaP-OaA is lowerthan native FFT if Nl

Nmgt Kl (where Nl denotes the minimum FFT size

such that Nl gt limд + lkern minus 1) The minimum ratio of CaP-OaArsquosnumber of operations over native frequency domain convolutionrsquos

number of operations is(Km middotNm

Nl

)2

The corollaries can be easily proven by directly comparing theoptimal CaP-OaA complexity with the optimal OaA complexity andthe optimal native approach complexity For OaA since we haveno further information to evaluate the eect of the ceiling functionwe need to calculate dKi e middot Ni for all Ni 6 limд + lkern minus 1

Figure 3 shows the eect of including CaP in the frequency do-main convolution We compare the algorithm of CaP-OaA with thehybrid algorithm using native approach and OaA We plot undertwo dierent FFT constraints The blue points are under the assump-tion of N = 4 16 64 256 The red dots show the performance inthe case of more hardware support N = 4 8 16 32 64 128 256The kernel size is assumed to be 3 times 3 There are very few pointswhere CaP-OaA is slightly worse than the native approach Mostof the time for both FFT sequences the extra exibility brought byCaP benets the computation complexity signicantly

One additional benet of applying CaP-OaA is that it makesone-size FFT sucient for all layers Images in deeper layers canbe CaP-transformed to eliminate the domain constraint of N 6limд + lkern minus 1 discussed in Section 31 Evaluation on AlexNetshows that the CaP-OaA technique reduces 575 of computationcompared with spatial convolution using only one-size FFT of size32

0 30 60 90 120 150 180 210 240

Image Size

02

04

06

08

10

Red

uctio

nin

Com

puta

tion

Com

plex

ityby

CaP

FFT Seq 1FFT Seq 2

Figure 3 Eect of CaP optimization y axis is calculated by dividing the number of operations of CaP-OaA by the number ofoperations of the hybrid algorithm discussed in Section 31

33 Discussion on the Various OptimizationTechniques

The various optimization techniques discussed so far can be viewedin a unied way By setting Batch folding factor to 1 Cap-OaAreduces to OaA By setting N = limд + lkern minus 1 OaA reduces tothe native frequency domain convolution Note that there is nopre-processing overhead incurred by OaA or CaP CaP completelyeliminates the need to recongure hardware when processing dif-ferent layers

We can perform optimization by an hybrid algorithm that com-bines all the three techniques native approach OaA and CaP-OaAThe three algorithms can be applied separately on dierent lay-ers Signicant reduction in computation complexity can thus beachieved for arbitrary CNNs with very little hardware

We should also notice that CaP is based on batch processing Inthis case although throughput is further optimized latency maybe compromised Thus using hybrid algorithm that uses OaA andnative algorithm can result in a low latency design while furtherinclusion of CaP-OaA can result in a high throughput design

4 SYSTEM ARCHITECTUREThis section gives an overview of the architecture design on FPGAto perform the hybrid algorithm for frequency domain convolutionThe architecture consists of three main modules to perform FFTHadamard product and IFFT operations on the image and kerneldata

41 Streaming 2D FFT ModuleThe 2D FFT on an N timesN matrix involves two phases of computationIn the rst phase we perform N -point 1D FFT on the N rows ofthe input matrix In the second phase we perform N -point 1D FFTon the N columns of phase 1rsquos output matrix Matrix transposeoperation is involved between the two phases We design resourceecient hardware for both the 1D FFT and the matrix transposeoperation

The radix-x N -point streaming 1D FFT architecture consists oflogx N computation stages To meet the bandwidth requirement ofinput data we can fold each stage fF FT times so that input paral-lelism of stage 0 is decreased from N words to N fF FT Streamingpermutation of sizemi = N

x iminus1 is required between stages i and (i+1)to support folding [6] Thus the total memory consumption by thepermutation network for the logx N stages is 2N + sum

mi asymp 3N where the 2N term is due to the bit-reversal in the last stage [7]

To perform the matrix transpose of size N times N between the twophases of 2D FFT we utilize the Streaming Permutation Network(SPN) proposed in [8] The SPN can perform arbitrary stride per-mutation by the in-place permutation in time algorithm The SPNonly requires size N times N single-port memory

Our proposed streaming 2D FFT module consists of PF FT unitsof N times N 2D FFT Each 2D FFT unit can be folded q2DF FT timesso that it contains N q2DF FT pipelines of N -point 1D FFT Each1D FFT pipeline can also be folded q1DF FT times so that the dataparallelism of each pipeline is N q1D The total data parallelism tothe 2D FFT module is PF FT middotN 2

q1DF FT middotq2DF FT

42 Multiply-and-Accumulate ModuleThe Hadamard product is performed by the Multiply-and-Accumulate(MAC) module Inputs to the MAC module are the kernel and imagetiles after the Fourier transform Outputs of the MAC are the resul-tant tiles after the Hadamard product computation The Pimд imagetiles and Pkern kernel tiles fed into the MAC module in one cycleshould have the same index in the fin dimension Then after fincycles the accumulator can complete the reduction on dimensionfin Data parallelism of Pimд middot Pkern middotN

2 is required to support thefull permutation of the set of Pimд image tiles and the set of Pkernkernel tiles If input is available every cycle the output rate of theMAC module is Pimд middotPkern tiles per fin cycles The required inputrate for image tiles is Pimд tiles per fout Pkern cycles Thus theratio of the input and output rate is finfout We can easily verifythat the MAC module design is work optimal

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

1 2 3 4 5

Convolution layers

0

1

2

3

4

Gig

aO

pera

tions Spatial

FFT-hybdOaA-16OaA-32OaA-64OaA-128

Figure 1 AlexNet Eect of FFT size on total number of oper-ations ldquoFFT-hybd is produced by the hybrid algorithm us-ing native FFT convolution and FFT convolution with OaA

Ot ile I F FT are much less thanOt ileHadamard (logN fin fout )Thus the complexity of OaA when applied to CNN should insteadbe O(l2imд middot fin middot fout ) Equation 4 can be approximated as

Oapproxtotal asymp C2 middot fin middot fout middot

(limд + lkern minus 11 minus (lkern minus 1)N

)2(5)

The domain of N is lkern minus 1 lt N 6 limд + lkern minus 1Note that partO

partN = minusC4 middot N(Nminuslkern+1)3

Thus we observe that(1) number of operations decreases as N increases (2) the ben-et of increasing N diminishes when N becomes suciently largerthan (lkern minus 1) By Observation (1) N should not fall in the left endregion of its domain By Observation (2) N should not be in the rightend region In this case there is little benet to increase N to matchthe image size when limд is very large Thus it can be concludedthat the middle ground between (lkern minus 1) and (limд + lkern minus 1)is the most suitable region for N

The motivation for OaA is to avoid using large-size FFT whendealing with large images Thus for small images we can simplyapply native frequency domain convolution Then the number ofoperations equals toC2 middot fin middot fout middot (N prime)2 where N prime is the smallestFFT size such that N prime gt limд + lkern minus 1 We can use the hybridalgorithm that uses the frequency domain convolution with OaAas well as the native frequency domain convolution

Figure 1 illustrates our analysis for AlexNet By using the OaAtechnique FFT with small variance in sizes can process quite wellimages of dierent scales The green line shows the performanceof the hybrid frequency domain algorithm The rst two layers useOaA based convolution for 32 16 points and the last three layersuse native frequency convolution for 32 16 16 points respectivelyCompared with spatial convolution our approach reduces the num-ber of operations by 506 using only two small FFT sizes of 16and 32

This section proposes an optimization by applying OaA in theCNN context The optimization greatly relieves the recongura-tion requirement for FFT hardware while preserving the nativefrequency domain convolutionrsquos advantage in terms of compu-tation complexity We will show in the next section our secondoptimization which further reduces the number of operations andcompletely eliminates the need for hardware reconguration

32 Optimization for Exploiting LimitedNumber of FFT Sizes

The analysis of Section 31 on both OaA based and the native FFT ne-glects the wasted computation incurred by image padding to matchFFT sizes However this may result in sub-optimality implemen-tations due to the limitations of realistic FFT hardware Withoutreasonable amount of reconguration high performance FFT hard-ware is only able to support limited number of dierent FFT sizesAs an example consider the case of radix-4 FFT design The FFTsizes of 4 16 64 256 need to handle images of any sizes rangingfrom ten to several hundred

The key problem is mismatch between image sizes and FFTsizes Since we have little control over the FFT sizes under thehardware constraints we address this problem by changing imagesizes through input data reshaping

Algorithm 1 Operations by a convolution layer1 procedure ConvLayer2 I is the input image array3 K is the kernel array4 for ib = 1 to Batch do5 for iout = 1 to fout do6 for iin = 1 to fin do7 T in[ib ][iout ][iin]larr I [ib ][iin] lowast K[iout ][iin]8 end for

9 T out [ib ][iout ]larrfinsumj=1

T in[ib ][iout ][j]

10 end for11 end for12 end procedure

The operation of the CNN can be summarized by Algorithm 1The operation under concern is the convolution in line 7 For thenested loop we would like to manipulate one of the Batch fout fin dimensions so that limд of image I can be resized Index ibtakes eect on I and is independent of K so Batch is the appro-priate dimension We refer to the manipulation of Batch and limдdimensions as the reshaping of input data

We propose a dual operation of OaA Concatenate and Pad (CaP)and apply it together with OaA to have the exibility on the limдdimension We need to ensure that reshaping does not aect theoutput result of convolution and introduces negligible overhead

We dene the CaP operation as follow Given a set of d2 imagesI of equal size limд times limд we arrange the images in a d times d meshwith x pixels of zero value between the vertically or horizontallyadjacent images CaP outputs a large image ICaP by concatenatingI and the zero pixels in between x is dened as the size of zeropadding Figure 2 illustrates such process The following theoremshows that by choosing appropriate x we can produce the correctoutput of convolution

Theorem 31 For a given kernel K of size lkern times lkern theconvolution using K on a set of images I is identical to performingconvolution using K on ICaP i the padding x is at least lkern minus 1

Proof The proof of xmin = lkern minus 1 can be done from theperspective of spatial or frequency domain convolution We prove

2

3

4

$

1

1

4

2

3

$amp

()

Figure 2 Illustration of the CaP algorithm

the theorem by OaA in frequency domain to illustrate the dualoperation

Dene as the partition operation by OaA (I lt ile ) outputsa set of tiles after partitioning I into lt ile times lt ile tiles T Deneoplusk as the concatenation operation oplusk (I ) outputs a large image byconcatenating the 2D mesh of I with k pixels of margin betweenadjacent images oplusminusk (I ) outputs an image by concatenating the2D mesh of I with k pixels of overlap between adjacent imagesDene Tp as the matrix by zero-padding T with p pixels after thelast pixels along the two dimensions

Performing convolution after CaP can be expressed by the fol-lowing set of equations

ICaP = oplusx (I )

Ix = limд+x (ICaP )

T =Fminus1(F ((Ix )lkernminus1) F (Kx+limдminus1)

)=Ix lowast K = (I lowast K )x = T x

lowast

ICaP lowast K = oplusminus(lkernminus1) (T ) = oplusminus(lkernminus1)(T xlowast )

= oplusxminus(lkernminus1) (Tlowast)

= oplusxminus(lkernminus1) (I lowast K )

From the oplusxminus(lkernminus1) operator in the last step it can be con-cluded that convolution on ICaP is identical to convolution on theset of I i x minus (lkern minus 1) gt 0 Thus the minimum value of x toavoid aliasing between adjacent images is (lkern minus 1)

Based on Theorem 31 we can CaP the images from a batch sothat input of Batchtimes fin times fout times l2imд is reshaped to Batch

d2 times fin times

fout times (lCaPimд )2 where lCaPimд = d middot limд + (d minus 1) middot lkern and d is

referred to as the Batch folding factor We can then apply the OaAtechnique on ICaPimд to complete the convolution Abbreviate FFTbased convolution with OaA as OaA and FFT based convolutionwith CaP and OaA as CaP-OaA We analyze in a convolution layerhow much improvement can be brought by CaP-OaA comparedwith OaA and the native approach We assume that the hardwaresupports the following FFT sizesN = [N1N2 Nm] where N1 ltN2 lt lt Nm Theorem 32 evaluates the optimal complexity forCaP-OaA Let Ki =

limд+lkernminus1Niminus(lkernminus1)

Theorem 32 The minimum computation complexity of CaP ap-plied to a CNN isC2 middot fin middot fout middotK2

m middotN2m whereC2 is some constant To

achieve such complexity the Batch folding factor should be a multipleof Nmminus(lkernminus1)

gcd (limд+lkernminus1Nmminus(lkernminus1))

Proof Based on Equation 4 the complexity of CaP-OaA ap-proach averaged for a single image is

OCaPminusOaA =C2 middot fin middot fout middotlceil lCaPimд + lkern minus 1Ni minus (lkern minus 1)

rceil2middotN 2i

d2

=C2 middot fin middot fout middotlceild middot Ki

rceil2middotN 2i

d2

(6)

The minimum computation complexity can thus be obtained bythe following inequality

OCaPminusOaA gt C2 middot fin middot fout middot K2i middot N

2i gt C2 middot fin middot fout middot K2

m middot N2m

We can set d as a multiple of Nmminus(lkernminus1)gcd (limд+lkernminus1Nmminus(lkernminus1))

to eliminate the ceiling function and choose Ni = Nm ThenminOCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Thus OminCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Corollary 33 The computation complexity of CaP-OaA is al-ways no higher than OaA The minimum ratio of CaP-OaArsquos number

of operations over OaArsquos number of operations ismaxi( KmdKi e

middotNmNi

)2

Corollary 34 The computation complexity of CaP-OaA is lowerthan native FFT if Nl

Nmgt Kl (where Nl denotes the minimum FFT size

such that Nl gt limд + lkern minus 1) The minimum ratio of CaP-OaArsquosnumber of operations over native frequency domain convolutionrsquos

number of operations is(Km middotNm

Nl

)2

The corollaries can be easily proven by directly comparing theoptimal CaP-OaA complexity with the optimal OaA complexity andthe optimal native approach complexity For OaA since we haveno further information to evaluate the eect of the ceiling functionwe need to calculate dKi e middot Ni for all Ni 6 limд + lkern minus 1

Figure 3 shows the eect of including CaP in the frequency do-main convolution We compare the algorithm of CaP-OaA with thehybrid algorithm using native approach and OaA We plot undertwo dierent FFT constraints The blue points are under the assump-tion of N = 4 16 64 256 The red dots show the performance inthe case of more hardware support N = 4 8 16 32 64 128 256The kernel size is assumed to be 3 times 3 There are very few pointswhere CaP-OaA is slightly worse than the native approach Mostof the time for both FFT sequences the extra exibility brought byCaP benets the computation complexity signicantly

One additional benet of applying CaP-OaA is that it makesone-size FFT sucient for all layers Images in deeper layers canbe CaP-transformed to eliminate the domain constraint of N 6limд + lkern minus 1 discussed in Section 31 Evaluation on AlexNetshows that the CaP-OaA technique reduces 575 of computationcompared with spatial convolution using only one-size FFT of size32

0 30 60 90 120 150 180 210 240

Image Size

02

04

06

08

10

Red

uctio

nin

Com

puta

tion

Com

plex

ityby

CaP

FFT Seq 1FFT Seq 2

Figure 3 Eect of CaP optimization y axis is calculated by dividing the number of operations of CaP-OaA by the number ofoperations of the hybrid algorithm discussed in Section 31

33 Discussion on the Various OptimizationTechniques

The various optimization techniques discussed so far can be viewedin a unied way By setting Batch folding factor to 1 Cap-OaAreduces to OaA By setting N = limд + lkern minus 1 OaA reduces tothe native frequency domain convolution Note that there is nopre-processing overhead incurred by OaA or CaP CaP completelyeliminates the need to recongure hardware when processing dif-ferent layers

We can perform optimization by an hybrid algorithm that com-bines all the three techniques native approach OaA and CaP-OaAThe three algorithms can be applied separately on dierent lay-ers Signicant reduction in computation complexity can thus beachieved for arbitrary CNNs with very little hardware

We should also notice that CaP is based on batch processing Inthis case although throughput is further optimized latency maybe compromised Thus using hybrid algorithm that uses OaA andnative algorithm can result in a low latency design while furtherinclusion of CaP-OaA can result in a high throughput design

4 SYSTEM ARCHITECTUREThis section gives an overview of the architecture design on FPGAto perform the hybrid algorithm for frequency domain convolutionThe architecture consists of three main modules to perform FFTHadamard product and IFFT operations on the image and kerneldata

41 Streaming 2D FFT ModuleThe 2D FFT on an N timesN matrix involves two phases of computationIn the rst phase we perform N -point 1D FFT on the N rows ofthe input matrix In the second phase we perform N -point 1D FFTon the N columns of phase 1rsquos output matrix Matrix transposeoperation is involved between the two phases We design resourceecient hardware for both the 1D FFT and the matrix transposeoperation

The radix-x N -point streaming 1D FFT architecture consists oflogx N computation stages To meet the bandwidth requirement ofinput data we can fold each stage fF FT times so that input paral-lelism of stage 0 is decreased from N words to N fF FT Streamingpermutation of sizemi = N

x iminus1 is required between stages i and (i+1)to support folding [6] Thus the total memory consumption by thepermutation network for the logx N stages is 2N + sum

mi asymp 3N where the 2N term is due to the bit-reversal in the last stage [7]

To perform the matrix transpose of size N times N between the twophases of 2D FFT we utilize the Streaming Permutation Network(SPN) proposed in [8] The SPN can perform arbitrary stride per-mutation by the in-place permutation in time algorithm The SPNonly requires size N times N single-port memory

Our proposed streaming 2D FFT module consists of PF FT unitsof N times N 2D FFT Each 2D FFT unit can be folded q2DF FT timesso that it contains N q2DF FT pipelines of N -point 1D FFT Each1D FFT pipeline can also be folded q1DF FT times so that the dataparallelism of each pipeline is N q1D The total data parallelism tothe 2D FFT module is PF FT middotN 2

q1DF FT middotq2DF FT

42 Multiply-and-Accumulate ModuleThe Hadamard product is performed by the Multiply-and-Accumulate(MAC) module Inputs to the MAC module are the kernel and imagetiles after the Fourier transform Outputs of the MAC are the resul-tant tiles after the Hadamard product computation The Pimд imagetiles and Pkern kernel tiles fed into the MAC module in one cycleshould have the same index in the fin dimension Then after fincycles the accumulator can complete the reduction on dimensionfin Data parallelism of Pimд middot Pkern middotN

2 is required to support thefull permutation of the set of Pimд image tiles and the set of Pkernkernel tiles If input is available every cycle the output rate of theMAC module is Pimд middotPkern tiles per fin cycles The required inputrate for image tiles is Pimд tiles per fout Pkern cycles Thus theratio of the input and output rate is finfout We can easily verifythat the MAC module design is work optimal

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

2

3

4

$

1

1

4

2

3

$amp

()

Figure 2 Illustration of the CaP algorithm

the theorem by OaA in frequency domain to illustrate the dualoperation

Dene as the partition operation by OaA (I lt ile ) outputsa set of tiles after partitioning I into lt ile times lt ile tiles T Deneoplusk as the concatenation operation oplusk (I ) outputs a large image byconcatenating the 2D mesh of I with k pixels of margin betweenadjacent images oplusminusk (I ) outputs an image by concatenating the2D mesh of I with k pixels of overlap between adjacent imagesDene Tp as the matrix by zero-padding T with p pixels after thelast pixels along the two dimensions

Performing convolution after CaP can be expressed by the fol-lowing set of equations

ICaP = oplusx (I )

Ix = limд+x (ICaP )

T =Fminus1(F ((Ix )lkernminus1) F (Kx+limдminus1)

)=Ix lowast K = (I lowast K )x = T x

lowast

ICaP lowast K = oplusminus(lkernminus1) (T ) = oplusminus(lkernminus1)(T xlowast )

= oplusxminus(lkernminus1) (Tlowast)

= oplusxminus(lkernminus1) (I lowast K )

From the oplusxminus(lkernminus1) operator in the last step it can be con-cluded that convolution on ICaP is identical to convolution on theset of I i x minus (lkern minus 1) gt 0 Thus the minimum value of x toavoid aliasing between adjacent images is (lkern minus 1)

Based on Theorem 31 we can CaP the images from a batch sothat input of Batchtimes fin times fout times l2imд is reshaped to Batch

d2 times fin times

fout times (lCaPimд )2 where lCaPimд = d middot limд + (d minus 1) middot lkern and d is

referred to as the Batch folding factor We can then apply the OaAtechnique on ICaPimд to complete the convolution Abbreviate FFTbased convolution with OaA as OaA and FFT based convolutionwith CaP and OaA as CaP-OaA We analyze in a convolution layerhow much improvement can be brought by CaP-OaA comparedwith OaA and the native approach We assume that the hardwaresupports the following FFT sizesN = [N1N2 Nm] where N1 ltN2 lt lt Nm Theorem 32 evaluates the optimal complexity forCaP-OaA Let Ki =

limд+lkernminus1Niminus(lkernminus1)

Theorem 32 The minimum computation complexity of CaP ap-plied to a CNN isC2 middot fin middot fout middotK2

m middotN2m whereC2 is some constant To

achieve such complexity the Batch folding factor should be a multipleof Nmminus(lkernminus1)

gcd (limд+lkernminus1Nmminus(lkernminus1))

Proof Based on Equation 4 the complexity of CaP-OaA ap-proach averaged for a single image is

OCaPminusOaA =C2 middot fin middot fout middotlceil lCaPimд + lkern minus 1Ni minus (lkern minus 1)

rceil2middotN 2i

d2

=C2 middot fin middot fout middotlceild middot Ki

rceil2middotN 2i

d2

(6)

The minimum computation complexity can thus be obtained bythe following inequality

OCaPminusOaA gt C2 middot fin middot fout middot K2i middot N

2i gt C2 middot fin middot fout middot K2

m middot N2m

We can set d as a multiple of Nmminus(lkernminus1)gcd (limд+lkernminus1Nmminus(lkernminus1))

to eliminate the ceiling function and choose Ni = Nm ThenminOCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Thus OminCaPminusOaA = C2 middot fin middot fout middot K2

m middot N2m

Corollary 33 The computation complexity of CaP-OaA is al-ways no higher than OaA The minimum ratio of CaP-OaArsquos number

of operations over OaArsquos number of operations ismaxi( KmdKi e

middotNmNi

)2

Corollary 34 The computation complexity of CaP-OaA is lowerthan native FFT if Nl

Nmgt Kl (where Nl denotes the minimum FFT size

such that Nl gt limд + lkern minus 1) The minimum ratio of CaP-OaArsquosnumber of operations over native frequency domain convolutionrsquos

number of operations is(Km middotNm

Nl

)2

The corollaries can be easily proven by directly comparing theoptimal CaP-OaA complexity with the optimal OaA complexity andthe optimal native approach complexity For OaA since we haveno further information to evaluate the eect of the ceiling functionwe need to calculate dKi e middot Ni for all Ni 6 limд + lkern minus 1

Figure 3 shows the eect of including CaP in the frequency do-main convolution We compare the algorithm of CaP-OaA with thehybrid algorithm using native approach and OaA We plot undertwo dierent FFT constraints The blue points are under the assump-tion of N = 4 16 64 256 The red dots show the performance inthe case of more hardware support N = 4 8 16 32 64 128 256The kernel size is assumed to be 3 times 3 There are very few pointswhere CaP-OaA is slightly worse than the native approach Mostof the time for both FFT sequences the extra exibility brought byCaP benets the computation complexity signicantly

One additional benet of applying CaP-OaA is that it makesone-size FFT sucient for all layers Images in deeper layers canbe CaP-transformed to eliminate the domain constraint of N 6limд + lkern minus 1 discussed in Section 31 Evaluation on AlexNetshows that the CaP-OaA technique reduces 575 of computationcompared with spatial convolution using only one-size FFT of size32

0 30 60 90 120 150 180 210 240

Image Size

02

04

06

08

10

Red

uctio

nin

Com

puta

tion

Com

plex

ityby

CaP

FFT Seq 1FFT Seq 2

Figure 3 Eect of CaP optimization y axis is calculated by dividing the number of operations of CaP-OaA by the number ofoperations of the hybrid algorithm discussed in Section 31

33 Discussion on the Various OptimizationTechniques

The various optimization techniques discussed so far can be viewedin a unied way By setting Batch folding factor to 1 Cap-OaAreduces to OaA By setting N = limд + lkern minus 1 OaA reduces tothe native frequency domain convolution Note that there is nopre-processing overhead incurred by OaA or CaP CaP completelyeliminates the need to recongure hardware when processing dif-ferent layers

We can perform optimization by an hybrid algorithm that com-bines all the three techniques native approach OaA and CaP-OaAThe three algorithms can be applied separately on dierent lay-ers Signicant reduction in computation complexity can thus beachieved for arbitrary CNNs with very little hardware

We should also notice that CaP is based on batch processing Inthis case although throughput is further optimized latency maybe compromised Thus using hybrid algorithm that uses OaA andnative algorithm can result in a low latency design while furtherinclusion of CaP-OaA can result in a high throughput design

4 SYSTEM ARCHITECTUREThis section gives an overview of the architecture design on FPGAto perform the hybrid algorithm for frequency domain convolutionThe architecture consists of three main modules to perform FFTHadamard product and IFFT operations on the image and kerneldata

41 Streaming 2D FFT ModuleThe 2D FFT on an N timesN matrix involves two phases of computationIn the rst phase we perform N -point 1D FFT on the N rows ofthe input matrix In the second phase we perform N -point 1D FFTon the N columns of phase 1rsquos output matrix Matrix transposeoperation is involved between the two phases We design resourceecient hardware for both the 1D FFT and the matrix transposeoperation

The radix-x N -point streaming 1D FFT architecture consists oflogx N computation stages To meet the bandwidth requirement ofinput data we can fold each stage fF FT times so that input paral-lelism of stage 0 is decreased from N words to N fF FT Streamingpermutation of sizemi = N

x iminus1 is required between stages i and (i+1)to support folding [6] Thus the total memory consumption by thepermutation network for the logx N stages is 2N + sum

mi asymp 3N where the 2N term is due to the bit-reversal in the last stage [7]

To perform the matrix transpose of size N times N between the twophases of 2D FFT we utilize the Streaming Permutation Network(SPN) proposed in [8] The SPN can perform arbitrary stride per-mutation by the in-place permutation in time algorithm The SPNonly requires size N times N single-port memory

Our proposed streaming 2D FFT module consists of PF FT unitsof N times N 2D FFT Each 2D FFT unit can be folded q2DF FT timesso that it contains N q2DF FT pipelines of N -point 1D FFT Each1D FFT pipeline can also be folded q1DF FT times so that the dataparallelism of each pipeline is N q1D The total data parallelism tothe 2D FFT module is PF FT middotN 2

q1DF FT middotq2DF FT

42 Multiply-and-Accumulate ModuleThe Hadamard product is performed by the Multiply-and-Accumulate(MAC) module Inputs to the MAC module are the kernel and imagetiles after the Fourier transform Outputs of the MAC are the resul-tant tiles after the Hadamard product computation The Pimд imagetiles and Pkern kernel tiles fed into the MAC module in one cycleshould have the same index in the fin dimension Then after fincycles the accumulator can complete the reduction on dimensionfin Data parallelism of Pimд middot Pkern middotN

2 is required to support thefull permutation of the set of Pimд image tiles and the set of Pkernkernel tiles If input is available every cycle the output rate of theMAC module is Pimд middotPkern tiles per fin cycles The required inputrate for image tiles is Pimд tiles per fout Pkern cycles Thus theratio of the input and output rate is finfout We can easily verifythat the MAC module design is work optimal

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

0 30 60 90 120 150 180 210 240

Image Size

02

04

06

08

10

Red

uctio

nin

Com

puta

tion

Com

plex

ityby

CaP

FFT Seq 1FFT Seq 2

Figure 3 Eect of CaP optimization y axis is calculated by dividing the number of operations of CaP-OaA by the number ofoperations of the hybrid algorithm discussed in Section 31

33 Discussion on the Various OptimizationTechniques

The various optimization techniques discussed so far can be viewedin a unied way By setting Batch folding factor to 1 Cap-OaAreduces to OaA By setting N = limд + lkern minus 1 OaA reduces tothe native frequency domain convolution Note that there is nopre-processing overhead incurred by OaA or CaP CaP completelyeliminates the need to recongure hardware when processing dif-ferent layers

We can perform optimization by an hybrid algorithm that com-bines all the three techniques native approach OaA and CaP-OaAThe three algorithms can be applied separately on dierent lay-ers Signicant reduction in computation complexity can thus beachieved for arbitrary CNNs with very little hardware

We should also notice that CaP is based on batch processing Inthis case although throughput is further optimized latency maybe compromised Thus using hybrid algorithm that uses OaA andnative algorithm can result in a low latency design while furtherinclusion of CaP-OaA can result in a high throughput design

4 SYSTEM ARCHITECTUREThis section gives an overview of the architecture design on FPGAto perform the hybrid algorithm for frequency domain convolutionThe architecture consists of three main modules to perform FFTHadamard product and IFFT operations on the image and kerneldata

41 Streaming 2D FFT ModuleThe 2D FFT on an N timesN matrix involves two phases of computationIn the rst phase we perform N -point 1D FFT on the N rows ofthe input matrix In the second phase we perform N -point 1D FFTon the N columns of phase 1rsquos output matrix Matrix transposeoperation is involved between the two phases We design resourceecient hardware for both the 1D FFT and the matrix transposeoperation

The radix-x N -point streaming 1D FFT architecture consists oflogx N computation stages To meet the bandwidth requirement ofinput data we can fold each stage fF FT times so that input paral-lelism of stage 0 is decreased from N words to N fF FT Streamingpermutation of sizemi = N

x iminus1 is required between stages i and (i+1)to support folding [6] Thus the total memory consumption by thepermutation network for the logx N stages is 2N + sum

mi asymp 3N where the 2N term is due to the bit-reversal in the last stage [7]

To perform the matrix transpose of size N times N between the twophases of 2D FFT we utilize the Streaming Permutation Network(SPN) proposed in [8] The SPN can perform arbitrary stride per-mutation by the in-place permutation in time algorithm The SPNonly requires size N times N single-port memory

Our proposed streaming 2D FFT module consists of PF FT unitsof N times N 2D FFT Each 2D FFT unit can be folded q2DF FT timesso that it contains N q2DF FT pipelines of N -point 1D FFT Each1D FFT pipeline can also be folded q1DF FT times so that the dataparallelism of each pipeline is N q1D The total data parallelism tothe 2D FFT module is PF FT middotN 2

q1DF FT middotq2DF FT

42 Multiply-and-Accumulate ModuleThe Hadamard product is performed by the Multiply-and-Accumulate(MAC) module Inputs to the MAC module are the kernel and imagetiles after the Fourier transform Outputs of the MAC are the resul-tant tiles after the Hadamard product computation The Pimд imagetiles and Pkern kernel tiles fed into the MAC module in one cycleshould have the same index in the fin dimension Then after fincycles the accumulator can complete the reduction on dimensionfin Data parallelism of Pimд middot Pkern middotN

2 is required to support thefull permutation of the set of Pimд image tiles and the set of Pkernkernel tiles If input is available every cycle the output rate of theMAC module is Pimд middotPkern tiles per fin cycles The required inputrate for image tiles is Pimд tiles per fout Pkern cycles Thus theratio of the input and output rate is finfout We can easily verifythat the MAC module design is work optimal

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

Image Buffer

Kernel Buffer

2D FFT

2D IFFT

MACMAC

FPGA

CPU

DRAM

Figure 4 Overall System Design

Table 2 Summary of Resource Consumption and Through-put

2D FFT MAC

Logic 2 middot logx N middot ( Nq1DF FT

minus 1) middotN

q2DF FTmiddot PF FT middot L

Pimд middot Pkern middot N2 middot L

Memory (2 middot 3Nf2DF FT

+N 2) middotPF FT middotD mdashmdash

Throughput PF FT middotN 2

q1DF FT middotq2DF FTPimд middotPkern middotN

2fin

We can also fold the MAC module by qMAC so that the granu-larity is not an entire N 2 matrix The data parallelism after foldingis Pimд middot Pkern middot N

2qMAC

43 Overall System DesignThe full system design is shown in Figure 4 The FFT modules indashed lines before the kernel buers will be discussed in Section 51The images tiles are processed by the 2D FFT module before theyare stored in the image buer The output tiles are transformedback to spatial domain by the 2D IFFT modules before they arewritten to external memory We use double buering technique tohide the memory latency of image and kernel tiles

The resource consumption and throughput for the FFT and MACmodules are summarized in Table 2 Constant L refers to the logicconsumption for one complex number adder and multiplierD refersto the memory consumption for one complex number

5 ARCHITECTURE LEVEL OPTIMIZATIONIn this section we proceed to the optimization at the architecturelevel so that we can perform the most ecient mapping for our op-timized algorithm given any CNN and target FPGA The challengesare two-folded On the one hand the various kinds of hardwaremodules as well as the ample resources on chip create a huge design

space to explore On the other hand since the same hardware isdeployed for accelerating all the convolution layers of a CNN it isnon-trivial to identify a global optimal conguration for the entireCNN

As a simple example we consider the inputoutput rate require-ment for an arbitrary architecture designed for the convolutionlayer If the input data is of shape fin times limд times limд and the outputis of shape fout times l

prime

imд times lprime

imд then regardless of the underlyinghardware the required ratio of input to output data rate should beapproximately finfout However such ratio diers greatly acrosslayers For AlexNet there is a 30x variation between the rst andthe last layer

Given the complexity of the problem we derive an analytic per-formance model for the architecture described in Section 4 Weformulate our problem as a non-linear optimization problem andpropose a technique to greatly reduce the design space to be ex-plored

51 Data FlowThere have been much discussion on how to design the data owso that memory reuse or locality can be exploited to a maximumextent Unlike the case of spatial convolution where memory accesspattern in the deep nested loop is complicated our approach usingfrequency domain convolution simplies the memory design bythe feature of the algorithm CaP and OaA are analogous to theloop tiling and unrolling [16] in spatial convolution Using thesetechniques we can scale up or down the tile and image sizes basedon the available hardware resources The key dierence from spa-tial convolution is the Hadamard product It conducts element-wiseoperation on the matrices meaning that all the loop carried depen-dencies are automatically eliminated This property ensures thatmassive parallelism can be exploited on the FPGA device

The priority for designing our data ow is to exploit as muchparallelism as possible for the MAC module During inference thekernel data is unchanged one option is to directly store the kernelsin frequency domain and transfer them to FPGA This saves FFTcomputation on the kernel data Thus we have more hardwareresource budget for the MAC module to improve the overall systemthroughput However the above approach has negative impact onthe system bandwidth Suppose the convolution in frequency do-main uses N -point FFT Then if we transfer to hardware the kernelsin frequency domain instead of the original kernels in spatial do-main the total data communication on kernels will increase fromO(l2kern ) to O(N 2) Depending on the value of N lkern and thehardware bandwidth logic resource either kind of kernel data owcan be more suitable Thus there is an optional FFT module beforethe kernel buer in Figure 4 In terms of data reuse we use the im-age oriented design so that image reuse is maximized To consumeone image buer we reload the kernel buer several times to makea full pass of the kernel data

52 Analysis of IO and Computation CostThe analysis in this section is based on the assumption that kernelsin frequency domain are loaded on FPGA Then we donrsquot havethe kernel FFT modules The analysis in this section can be easilymodied for the other kernel data ow

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

We consider the time period in which one image buer is con-sumed completely We call this one round of computation The peakthroughput of the FFT MAC and IFFT modules can be found inTable 2 We denote the peak throughput asTF FT TMAC andTI F FT(the equations for them are summarized in Table 2)

The full external bandwidth (Bsys ) is shared by FFT module(BF FT ) kernel buer (Bkern ) and IFFT module (BI F FT ) The amountof data transferred to FFT is DF FT = Mimд and the amount of datato kernel buer is Dkern = fin middot fout middot N

2 based on our data reusescheme The IFFT module produces DI F FT = fout

finmiddot Dimд amount

of outputThe IO time t IOround is determined by the largest time needed by

DF FT Dkern andDI F FT Since data is streamed from MAC IFFT toexternal memory we deal with DI F FT later in the computation costcalculation The bandwidth BF FT and Bkern should be assignedbased on the workload DF FT Dkern as well as the peak throughputof the subsequent FFT module This is shown in Equation 7

BF FT =min DF FTDF FT + Dkern

(Bsys minus BI F FT )TF FT

Bkern =(Bsys minus BI F FT ) minus BF FT(7)

The IO time is calculated as t IOround = DF FTBF FT gt

DkernBkern

The computation time tcomp

round is dened as the time period forconsuming one image buer by the MAC and IFFT module It isbounded by the peak throughput of the two modules as well asthe time for loading the image buer The throughput or outputbandwidth of the MAC and IFFT module is shown by Equation 8

BI F FT = minDI F FT

t IOround

TMAC TI F FT

(8)

By using Equation 7 as well as the relation among t IOround DF FTand DI F FT we express Equation 8 in the following way

BI F FT = minTMAC TI F FT

foutfin

TF FT Q

1 +Q Bsys

(9)

where Q = foutfinmiddot

Mimд

Mimд+fin middotfout middotN 2 Equation 9 shows how the system throughput would be deter-

mined by the various hardware resources and the CNN specicationTMAC TF FT TI F FT Bsys and Mimд captures the hardware con-straints of logic resources external bandwidth and on-chip memoryfin fout and N reveal the impact of CNN specication on the hard-ware performance The problem is that for dierent convolutionlayers fin fout N diers We can use the CaP-OaA technique dis-cussed in Section 3 to reduce or even remove the variance of N Wewill deal with the variance of fin and fout in Section 53

The computation time for one round is given by tcompround = DI F FT

BI F FT In the above design the image buer consuming rate never ex-

ceeds the producing rate Thus the total time of a round is tround =tcompround For the double-buering to completely hide the memory

latency minTMAC TI F FT

6 min

foutfin

TF FT Q1+Q Bsys

has to

be satisedFor the other kind of design containing the kernel FFT module

corresponding changes need to be made for the above equationsFor

example Dkern is changed to fin middot fout middot l2imд We can derive its

performance model using the similar idea as the above discussion

53 Design Space ExplorationThe processing of a convolution layer involves several rounds Totaltime for processing a convolution layer is given by

tlayer = fin middot(llowastimд )

2

Mimдmiddot tround =

fout middot (llowastimд )2

BI F FT(10)

where llowastimд is image size after the padding or Batch foldingoperation by the hybrid algorithm discussed in Section 3

The goal of our architecture optimization is to identify the cong-uration of dierent modules so that throughput or latency regardingthe entire CNN is optimized We dene tCNN as the time to processa single image by all convolution layers For batch processing inthe case of CaP-OaA tCNN is averaged over the d2 images in ICaP Thus regardless of the metric being throughput or latency weminimize tCNN We formulate an optimization problem as follow

minimizeH

sumlayer i

t ilayer

subject to LOGICF FT MAC I F FT 6 LOGICsysMEMORYF FT MAC I F FT 6 MEMORYsys

where the set of architectural parameters H includes N q1DF FT q1DI F FT q2DF FT q2DI F FT PF FT PI F FT Mimд Mkern Pimд Pkern qMAC The parameters are dened in Section 4 and 5

LOGICF FT MAC I F FT andMEMORYF FT MAC I F FT refer to thelogic or on-chip memory consumption by the FFT MAC IFFT mod-ules The terms LOGICsys and MEMORYsys refer to the availablelogic and on-chip memory on the FPGA board

The design space is of very high dimension We can removetwo variables by the following observation (1) N can be chosenby our algorithm level optimization (2) Mkern does not aect theperformance because of the image oriented data reuse (3) Pimд Pkern always appear in the form of Pimд middot Pkern The design spacenow consists of 9 variables

The main diculty in further simplication of this optimizationproblem is the min operation in the BI F FT term This min functioncaptures what kind of hardware resource is the bottleneck of ourdesign The reason why we can not evaluate the min function isthat for dierent convolution layers the bottleneck can be dierentIn other words if our target is to optimize a single layer only thedesign space will be much smaller For a single layer optimizationproblem we can prove by contradiction that there exists an optimalconguration where all the hardware resources are fully utilizedsuch that

BI F FT = TMAC = TI F FT = foutfin

TF FT = Q

1 +Q Bsys (11)

Thus the original optimization problem now contains three moreequality constraints The number of free variables in the designspace is reduced from 9 to 6 The new design space is dened byHprime = MF FT Pimд q1DF FT q1DI F FT q2DF FT q2DI F FT

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

For the optimization problem on a complete CNN we can use theoptimum of a single convolution layer as an indication Based onEquation 9 and Table 2 we observe how dierent hardware modulesor resource become the performance bottleneck when fin and foutvary with the convolution layers For the rst few layers when fin fout are both small and the ratio fout fin is largeTMAC foutfin

TF FT

and Q1+Q Bsys are of higher value In this case TI F FT will be the

bottleneck Similarly for deeper layers the bottleneck may changeto other hardware modules

We propose an ecient design space exploration algorithm for acomplete CNN by bounding the design space dened byHwith theoptimum for each single layer The single layer optimum is obtainedby exploring the design space dened by Hprime The procedure isshown by Algorithm 2

Algorithm 2 Design Space Exploration with Indication1 procedure DesignSpaceExploration2 Conduct design space exploration on a complete CNN3 OPT stores the optimum for single layer4 OPT [i] isin H5 OPT []larr NULL6 for i = 0 to Layer l do7 Perform design space exploration over Hprime8 OPT [i]larr optimum for layer i9 end for

10 Ranдe[d] stores the range of value for dimension d11 Ranдe[]larr NULL12 for d = 0 to Dimension D do13 Ranдe[d]larr [mini OPT [i][d]maxi OPT [i][d]]14 end for15 Design space exploration of the entire CNN with the design

space bounded by Ranдe[]16 end procedure

The key to our algorithm is that it transforms a large design spaceexploration problem into limited number of sub-problems of muchsmaller sizes The design space exploration algorithm implementedin C++ executes in several minutes on a desktop platform

6 EXPERIMENTS61 Experimental SetupWe choose the Intel Heterogeneous Research Platform (HARP) [1]as our primary experimental platform The target platform withintegrated CPU and FPGA is shown in Figure 5 Intel HARP is apre-production of Intel QuickAssist QPI FPGA Platform where IntelXeon E5-2600 v2 processor and Altera Stratix V FPGA are integratedThe data communication between CPU and FPGA is through acoherent shared memory workspace The DRAM access granularityfor FPGA is one cache line width which is 64 bytes The CPUand FPGA communicate through Intel QuickPath Interconnection(QPI) [29] with 64 GTs theoretical bandwidth1 The measuredpeak bandwidth from the shared memory to FPGA is 5GBs The

1Giga Transactions per second

Main memory

FPGA

Core Core

On-chip cache hierarchy

Coherent memory interface

BRAM

Coherent memory interface

LUTs

SystemInterfaceOn-chip

Interconnection

CPU

Other IO devices

Figure 5 Target architecture integrating general purposeprocessors and FPGA

Altera Stratix V FPGA in HARP has 234720 ALMs and 256 DSPblocks The FPGA on-chip memory is of size 625MB

In our experiment we distribute the workload between CPU andFPGA as follows The FPGA performs the FFT Hadamard productand IFFT operations for the input image tiles and the kernel tilesThe CPU is responsible for other light-weight operations includingpartitioning the images into tiles and combining the tiles into thenal output according to the the OaA and CaP algorithm discussedin Section 3

62 Low Latency or High Throughput DesignBased on our algorithm-architecture co-design we propose two im-plementations of AlexNet on the HARP platform The low-latencydesign chooses 64-point FFT for the rst two layers and 16-pointFFT for the last three layers It uses the hybrid algorithm describedin Section 31 The high-throughput design chooses 64-point FFTfor all the ve layers It uses the hybrid algorithm described inSection 32 The data precision is 16-bit xed point To save externalmemory bandwidth we use the design containing kernel 2D FFTmodules Two sets of hardware conguration H are generated byour architectural design space exploration algorithm as shown inTable 4 The data parallelism of the kernel 2D FFT module is 4 Forthe two congurations the high throughout design requires cong-uration 1 and the low latency design requires both conguration 1and 2 The low latency design switches between the two dierentcongurations by runtime reconguration The radix-4 FFT designexplained in Section 41 can change between 16-point and 64-pointwithout too much overhead in time or hardware resource In ourdesign it is realized by bypassing the last stage of the 64-point 1DFFT pipeline For the MAC module note that in Table 4 the dataparallelism of MAC for the two congurations are the same Thusthe only thing to do for MAC module reconguration is to changethe memory read locations into the image and kernel buers Notethat no runtime reconguration is involved in the high throughputdesign This is one benet of using the CaP algorithm

63 Performance EvaluationWe compare our work with the state-of-the-art AlexNet implemen-tations on FPGAs as shown in Table 3 We report the throughputand latency for all the ve convolution layers of AlexNet The host

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

Table 3 Comparison with Other AlexNet Implementations on FPGAs

DATErsquo16 [21] FPGArsquo16 [25] FPLrsquo16 [17] FPGArsquo17 [28] Our DesignLow Latency High Throughput

FPGA Virtex-7 VC707 Stratix-V GSD8 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7 Stratix-V GXA7Frequency (MHz) 160 ndash 100 200 200 200

Precision Fixed 32-bit Fixed 8-16 bits Fixed 8-16 bits Floating 32-bit Fixed 16-bit Fixed 16-bitDSP Utilization 2688 (96) 234 (914) 256 (100) 224(875) 232(906) 256 (100)

Logic Utilization 45K (9) 152K (65) 121K (52) 201K (86) 200K (85) 199K (85)On-chip RAM 543 (53) 1744 (71) 1552 (61) 2000 (80) 2143 (85) 2130 (85)Latency (ms) ndash 201 1275 4081 124 512

Throughput (GOPS) 14782 724 1145 830 1177 1634

Table 4 Hardware Conguration of the Two Designs

Conf Mimд (MB) PMAC middotN 2qMAC

PF FT middotN 2q1DF FT middotq2DF FT

PI F FT middotN 2q1DI F FT middotq2DI F FT

1 24 1middot6424

1middot64232middot16

1middot64264middot16

2 24 4middot1621

1middot1628middot4

1middot16216middot4

CPU computes concurrently with FPGA for other operations in-volved in the CNN The CPU processing time completely overlapswith the processing time of the FPGA

The performance of our two designs is as expected The low la-tency design achieves latency of 124ms which is lower than all theother implementations listed in Table 3 The high throughput designachieves throughput of 1634 GOPS which is the best among all thelisted implementations The performance improvement comes fromthe two-level optimization discussed in the previous sections Thealgorithmic optimization reduces the required number of opera-tions and the architectural optimization identies the most suitableconguration of the hardware modules The architectures of thehigh throughput and the low latency designs are very similar Thesame hardware can be reused by the two designs with optionalruntime reconguration of the 2D FFT modules

To get deeper understanding of our implementation we analyzethe experiment data based on Equation 9 Table 5 shows the FPGAthroughput for each convolution layer in the case of the low latencydesign The middle columns show the peak throughput achievablegiven the hardware conguration of FFT MAC IFFT and the peakbandwidth to external memory

Table 5 Analysis on the Bottleneck of Our System (Low La-tency Design) Unit of the values ComplexWordSec

TMAC TI F FTfoutfin

TF FTQ1+Q Bsys Bottleneck

CONV1 3413 80 1280 61 BsysCONV2 107 80 107 42 BsysCONV3 40 80 60 32 BsysCONV4 27 80 40 19 BsysCONV5 27 80 27 18 Bsys

From Table 5 we conclude that the external bandwidth is thebottleneck of the system This observation is veried by the mea-sured latency The single image classication latency should be atleast the data transfer time of image data kernel data and outputdata The time tmem can be easily calculated by dividing the totalamount of data with the peak memory bandwidth The measuredlatency of 124ms is very close to tmem meaning that our designsaturates the memory bandwidth Given a system with higher ex-ternal memory bandwidth the performance of our design can befurther improved

7 CONCLUSIONIn this paper we presented a novel algorithm-architecture co-designmethodology to improve the latency or throughput of large scaleCNNs on FPGAs Our design was based on two observations onstate-of-the-art CNNs ndash large variance in image sizes limд and largevariance in the number of feature maps fin and fout Our algorithmoptimization was based on the OaA technique applied to frequencydomain convolution To address the variance of limд we proposeda dual operation of OaA called Concatenate and Pad (CaP) Thistechnique reduces the computation complexity of convolution andeliminate the hardware overhead for runtime reconguration Toaddress the variance of fin and fout we proposed a system designto support our algorithmic optimization We further formulatedthe analytical performance model for our system We derived anecient design space exploration algorithm by reducing the numberof variables using the optimum related with a single convolutionlayer

We demonstrated our co-design methodology by implement-ing two designs of AlexNet on the Intel HARP platform Our low-latency implementation achieved latency of 124ms and our high-throughput implementation achieved throughput of 163 GOPS Ourimplementations outperform state-of-the-art implementations ofAlexNet on FPGAs signicantly

In the future we will further generalize the co-design methodol-ogy to support wider range of CNNs and FPGAs We will implementtechniques such as layer partitioning to handle the cases whereon-chip memory is not large enough for a large fin value We willalso compare our work with GPU and multi-core implementations

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References

REFERENCES[1] 2015 Intel Inc Xeon+FPGA Platform for the Data Center (2015) httpswww

ececmueducalcmcarllibexefetchphpmedia=carl15-guptapdf[2] 2017 Overlap-add method (2017) httpsenwikipediaorgwikiOverlap-add_

method[3] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen

Craig Citro Gregory S Corrado Andy Davis Jerey Dean Matthieu Devin San-jay Ghemawat Ian J Goodfellow Andrew Harp Georey Irving Michael IsardYangqing Jia Rafal Joacutezefowicz Lukasz Kaiser Manjunath Kudlur Josh LevenbergDan Maneacute Rajat Monga Sherry Moore Derek Gordon Murray Chris Olah MikeSchuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul ATucker Vincent Vanhoucke Vijay Vasudevan Fernanda B Vieacutegas Oriol VinyalsPete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng2016 TensorFlow Large-Scale Machine Learning on Heterogeneous DistributedSystems CoRR abs160304467 (2016) httparxivorgabs160304467

[4] B Ahn 2015 Real-time video object recognition using convolutional neuralnetwork In 2015 International Joint Conference on Neural Networks (IJCNN)1ndash7 DOIhttpsdoiorg101109IJCNN20157280718

[5] Utku Aydonat Shane OrsquoConnell Davor Capalija Andrew C Ling and Gordon RChiu 2017 An OpenCLtradeDeep Learning Accelerator on Arria 10 (2017) 55ndash64DOIhttpsdoiorg10114530200783021738

[6] R Chen H Le and V K Prasanna 2013 Energy ecient parameterized FFT ar-chitecture In 2013 23rd International Conference on Field programmable Logicand Applications 1ndash7 DOIhttpsdoiorg101109FPL20136645545

[7] R Chen N Park and V K Prasanna 2013 High throughput energy ecientparallel FFT architecture on FPGAs In 2013 IEEE High Performance ExtremeComputing Conference (HPEC) 1ndash6 DOIhttpsdoiorg101109HPEC20136670343

[8] Ren Chen Sruja Siriyal and Viktor Prasanna 2015 Energy and MemoryEcient Mapping of Bitonic Sorting on FPGA In Proceedings of the 2015ACMSIGDA International Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) ACM New York NY USA 240ndash249 DOIhttpsdoiorg10114526847462689068

[9] R DiCecco G Lacey J Vasiljevic P Chow G Taylor and S Areibi 2016 Caf-feinated FPGAs FPGA Framework For Convolutional Neural Networks ArXive-prints (Sept 2016) arXivcsCV160909671

[10] Cleacutement Farabet Berin Martini Polina Akselrod Selccediluk Talay Yann LeCunand Eugenio Culurciello 2010 Hardware accelerated convolutional neuralnetworks for synthetic vision systems In Proceedings of 2010 IEEE InternationalSymposium on Circuits and Systems IEEE 257ndash260

[11] Mario Garrido J Grajal M A Saacutenchez and Oscar Gustafsson 2013 PipelinedRadix-2K Feedforward FFT Architectures IEEE Trans Very Large Scale IntegrSyst 21 1 (Jan 2013) 23ndash32 DOIhttpsdoiorg101109TVLSI20112178275

[12] Tyler Highlander and Andres Rodriguez 2016 Very Ecient Training of Convo-lutional Neural Networks using Fast Fourier Transform and Overlap-and-AddCoRR abs160106815 (2016) httparxivorgabs160106815

[13] Yangqing Jia Evan Shelhamer Je Donahue Sergey Karayev Jonathan LongRoss Girshick Sergio Guadarrama and Trevor Darrell 2014 Cae ConvolutionalArchitecture for Fast Feature Embedding In Proceedings of the 22Nd ACMInternational Conference on Multimedia (MM rsquo14) ACM New York NY USA675ndash678 DOIhttpsdoiorg10114526478682654889

[14] Alex Krizhevsky Ilya Sutskever and Georey E Hinton 2012 Imagenet clas-sication with deep convolutional neural networks In Advances in NeuralInformation Processing Systems 1097ndash1105

[15] Andrew Lavin 2015 Fast Algorithms for Convolutional Neural Networks CoRRabs150909308 (2015) httparxivorgabs150909308

[16] Yufei Ma Yu Cao Sarma Vrudhula and Jae-sun Seo 2017 Optimizing LoopOperation and Dataow in FPGA Acceleration of Deep Convolutional NeuralNetworks In Proceedings of the 2017 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo17) ACM New York NY USA 45ndash54DOIhttpsdoiorg10114530200783021736

[17] Yufei Ma N Suda Yu Cao J s Seo and S Vrudhula 2016 Scalable and modu-larized RTL compilation of Convolutional Neural Networks onto FPGA In 201626th International Conference on Field Programmable Logic and Applications(FPL) 1ndash8 DOIhttpsdoiorg101109FPL20167577356

[18] Michaeumll Mathieu Mikael Hena and Yann LeCun 2013 Fast Training ofConvolutional Networks through FFTs CoRR abs13125851 (2013) httparxivorgabs13125851

[19] Kalin Ovtcharov Olatunji Ruwase Joo-Young Kim Jeremy Fow-ers Karin Strauss and Eric Chung 2015 Accelerating Deep Con-volutional Neural Networks Using Specialized Hardware (Febru-ary 2015) httpswwwmicrosoftcomen-usresearchpublicationaccelerating-deep-convolutional-neural-networks-using-specialized-hardware

[20] Jiantao Qiu Jie Wang Song Yao Kaiyuan Guo Boxun Li Erjin Zhou JinchengYu Tianqi Tang Ningyi Xu Sen Song Yu Wang and Huazhong Yang 2016Going Deeper with Embedded FPGA Platform for Convolutional Neural Net-work In Proceedings of the 2016 ACMSIGDA International Symposium on

Field-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 26ndash35 DOIhttpsdoiorg10114528472632847265

[21] A Rahman J Lee and K Choi 2016 Ecient FPGA acceleration of Con-volutional Neural Networks using logical-3D compute array In 2016 DesignAutomation Test in Europe Conference Exhibition (DATE) 1393ndash1398

[22] Dominik Scherer Hannes Schulz and Sven Behnke 2010 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors InProceedings of the 20th International Conference on Articial Neural NetworksPart III (ICANNrsquo10) Springer-Verlag Berlin Heidelberg 82ndash91 httpdlacmorgcitationcfmid=18864361886446

[23] H Sharma J Park D Mahajan E Amaro J K Kim C Shao A Mishra and HEsmaeilzadeh 2016 From high-level deep neural models to FPGAs In 2016 49thAnnual IEEEACM International Symposium on Microarchitecture (MICRO) 1ndash12 DOIhttpsdoiorg101109MICRO20167783720

[24] Karen Simonyan and Andrew Zisserman 2014 Very Deep ConvolutionalNetworks for Large-Scale Image Recognition CoRR abs14091556 (2014)httparxivorgabs14091556

[25] Naveen Suda Vikas Chandra Ganesh Dasika Abinash Mohanty Yufei MaSarma Vrudhula Jae-sun Seo and Yu Cao 2016 Throughput-OptimizedOpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Net-works In Proceedings of the 2016 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo16) ACM New York NY USA 16ndash25DOIhttpsdoiorg10114528472632847276

[26] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabi-novich 2014 Going Deeper with Convolutions CoRR abs14094842 (2014)httparxivorgabs14094842

[27] Chen Zhang Peng Li Guangyu Sun Yijin Guan Bingjun Xiao and Jason Cong2015 Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks In Proceedings of the 2015 ACMSIGDA International Symposium onField-Programmable Gate Arrays (FPGA rsquo15) ACM New York NY USA 161ndash170 DOIhttpsdoiorg10114526847462689060

[28] Chi Zhang and Viktor Prasanna 2017 Frequency Domain Acceleration ofConvolutional Neural Networks on CPU-FPGA Shared Memory System (2017)35ndash44 DOIhttpsdoiorg10114530200783021727

[29] Dimitrios Ziakas Allen Baum Robert A Maddox and Robert J Safranek 2010Intelreg quickpath interconnect architectural features supporting scalable sys-tem architectures In High Performance Interconnects (HOTI) 2010 IEEE 18thAnnual Symposium on IEEE 1ndash6

  • Abstract
  • 1 Introduction
  • 2 Background and Related Work
    • 21 Convolution in Frequency Domain
    • 22 Overlap and Add (OaA)
    • 23 CNN Models
    • 24 CNN Accelerators
      • 3 Algorithm Level Optimization
        • 31 Optimization for Images of Different Scales
        • 32 Optimization for Exploiting Limited Number of FFT Sizes
        • 33 Discussion on the Various Optimization Techniques
          • 4 System Architecture
            • 41 Streaming 2D FFT Module
            • 42 Multiply-and-Accumulate Module
            • 43 Overall System Design
              • 5 Architecture Level Optimization
                • 51 Data Flow
                • 52 Analysis of IO and Computation Cost
                • 53 Design Space Exploration
                  • 6 Experiments
                    • 61 Experimental Setup
                    • 62 Low Latency or High Throughput Design
                    • 63 Performance Evaluation
                      • 7 Conclusion
                      • References