1 accelerating generative neural networks on unmodified ... · software approach dawen xu, member,...

12
1 Accelerating Generative Neural Networks on Unmodified Deep Learning Processors - A Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE, Bingsheng He, Member, IEEE, Lei Zhang, Member, IEEE Abstract—Generative neural network is a new category of neural networks and it has been widely utilized in applications such as content generation, unsupervised learning, segmentation and pose estimation. It typically involves massive computing-intensive deconvolution operations that cannot be fitted to conventional neural network processors directly. However, prior works mainly investigated specialized hardware architectures through intensive hardware modifications to the existing deep learning processors to accelerate deconvolution together with the convolution. In contrast, this work proposes a novel deconvolution implementation with a software approach and enables fast and efficient deconvolution execution on the legacy deep learning processors. Our proposed method reorganizes the computation of deconvolution and allows the deep learning processors to treat it as the standard convolution by splitting the original deconvolution filters into multiple small filters. Compared to prior acceleration schemes, the implemented acceleration scheme achieves 2.41× - 4.34× performance speedup and reduces the energy consumption by 27.7% - 54.5% on a set of realistic benchmarks. In addition, we also applied the deconvolution computing approach to the off-the-shelf commodity deep learning processors. The performance of deconvolution also exhibits significant performance speedup over prior deconvolution implementations. Index Terms—Generative neural network, deconvolution accelerator, split deconvolution. 1 I NTRODUCTION D Eep neural networks are making continuous break- throughs in massive research territories over the years. In contrast to the conventional convolutional neural net- works heavily utilized for object classification and detection, generative neural networks [1] have been proved to be su- perior in a broad domain of applications including content- generation, unsupervised learning, segmentation and pose estimation. Typically, the generative neural networks in- volve both convolutional layers and deconvolutional lay- ers. Both layers are compute-intensive and are the perfor- mance bottleneck of generative neural networks. Therefore, it is demanded to accelerate the backbone architecture of the networks, especially the generative networks on end- devices for real-time and low power applications such as real-time deepfake [2] and style transfer [3]. For exemplary generative neural network benchmarks described in Table 1, the deconvolution layers contribute to the major overhead Dawen Xu is with the School of Electronic Science & Applied Physics, Hefei University of Technology, Anhui, China, 230009. E-mail: xdw [email protected] Ying Wang is with the State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 100089. E-mail: [email protected] Kaijie Tu, Cheng Liu and Lei Zhang are with the Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 100089. E-mail: {tukaijie, liucheng, zlei}@ict.ac.cn Bingsheng He is with the Department of Computer Science, School of Computing, National University of Singapore, Singapore, 119260. E-mail: [email protected] Manuscript received December 15, 2019. of the multiply-and-add operations in the benchmark (The total operands refer to those of the inference phase). The deconvolution operation is used as an indispensable com- ponent to restore the condensed feature maps to full-size at the top of the networks, which are the common architectures in generative networks and other popular models used for semantic segmentation and instance detection [4]. Hardware specialization is a popular approach to acceler- ate the computation of neural network based applications. To accelerate generative neural networks with customized hardware other than general purpose compute units, re- searchers have tried a number of approaches from distinct angles. For more efficient design, an intuitive solution is to reuse the convolution processor and build a unified fully convolutional processor for both convolution and decon- volution operations. In such architectures input data of deconvolution can be reorganized by dynamically padding zero activations to the original feature maps and then treat the deconvolution as the conventional convolution layer as presented in Figure 1. Figure 1(a) is an example of the classic deconvolutional operation with the stride of 2, while Figure 1(b) is converted equivalent convolutional operation with stride set to be 1. Eventually, the deconvolution can be mapped to the convolution processor without any hard- ware modification. However, the zero activations induce considerable redundant computing and degrade the per- formance which is illustrated in [5]. Although many CNN accelerators [5], [6], [7], [8], [9], [10]are able to skip the zero activations during the computing through additional zero detection logic, they typically can only skip a portion of the zero activations especially the ones that are located on the boundary of the feature maps. However, the zero padding arXiv:1907.01773v3 [cs.LG] 29 Apr 2020

Upload: others

Post on 09-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

1

Accelerating Generative Neural Networks onUnmodified Deep Learning Processors - A

Software ApproachDawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu,

Member, IEEE, Bingsheng He, Member, IEEE, Lei Zhang, Member, IEEE

Abstract—Generative neural network is a new category of neural networks and it has been widely utilized in applications such as contentgeneration, unsupervised learning, segmentation and pose estimation. It typically involves massive computing-intensive deconvolutionoperations that cannot be fitted to conventional neural network processors directly. However, prior works mainly investigated specializedhardware architectures through intensive hardware modifications to the existing deep learning processors to accelerate deconvolutiontogether with the convolution. In contrast, this work proposes a novel deconvolution implementation with a software approach andenables fast and efficient deconvolution execution on the legacy deep learning processors. Our proposed method reorganizes thecomputation of deconvolution and allows the deep learning processors to treat it as the standard convolution by splitting the originaldeconvolution filters into multiple small filters. Compared to prior acceleration schemes, the implemented acceleration scheme achieves2.41× - 4.34× performance speedup and reduces the energy consumption by 27.7% - 54.5% on a set of realistic benchmarks.In addition, we also applied the deconvolution computing approach to the off-the-shelf commodity deep learning processors. Theperformance of deconvolution also exhibits significant performance speedup over prior deconvolution implementations.

Index Terms—Generative neural network, deconvolution accelerator, split deconvolution.

F

1 INTRODUCTION

D Eep neural networks are making continuous break-throughs in massive research territories over the years.

In contrast to the conventional convolutional neural net-works heavily utilized for object classification and detection,generative neural networks [1] have been proved to be su-perior in a broad domain of applications including content-generation, unsupervised learning, segmentation and poseestimation. Typically, the generative neural networks in-volve both convolutional layers and deconvolutional lay-ers. Both layers are compute-intensive and are the perfor-mance bottleneck of generative neural networks. Therefore,it is demanded to accelerate the backbone architecture ofthe networks, especially the generative networks on end-devices for real-time and low power applications such asreal-time deepfake [2] and style transfer [3]. For exemplarygenerative neural network benchmarks described in Table 1,the deconvolution layers contribute to the major overhead

• Dawen Xu is with the School of Electronic Science & Applied Physics,Hefei University of Technology, Anhui, China, 230009.E-mail: xdw [email protected]

• Ying Wang is with the State Key Laboratory of Computer Architecture,Institute of Computing Technology, Chinese Academy of Sciences, Beijing,China, 100089.E-mail: [email protected]

• Kaijie Tu, Cheng Liu and Lei Zhang are with the Research Centerfor Ubiquitous Computing Systems, Institute of Computing Technology,Chinese Academy of Sciences, Beijing, China, 100089.E-mail: {tukaijie, liucheng, zlei}@ict.ac.cn

• Bingsheng He is with the Department of Computer Science, School ofComputing, National University of Singapore, Singapore, 119260.E-mail: [email protected]

Manuscript received December 15, 2019.

of the multiply-and-add operations in the benchmark (Thetotal operands refer to those of the inference phase). Thedeconvolution operation is used as an indispensable com-ponent to restore the condensed feature maps to full-size atthe top of the networks, which are the common architecturesin generative networks and other popular models used forsemantic segmentation and instance detection [4].

Hardware specialization is a popular approach to acceler-ate the computation of neural network based applications.To accelerate generative neural networks with customizedhardware other than general purpose compute units, re-searchers have tried a number of approaches from distinctangles. For more efficient design, an intuitive solution is toreuse the convolution processor and build a unified fullyconvolutional processor for both convolution and decon-volution operations. In such architectures input data ofdeconvolution can be reorganized by dynamically paddingzero activations to the original feature maps and then treatthe deconvolution as the conventional convolution layeras presented in Figure 1. Figure 1(a) is an example of theclassic deconvolutional operation with the stride of 2, whileFigure 1(b) is converted equivalent convolutional operationwith stride set to be 1. Eventually, the deconvolution canbe mapped to the convolution processor without any hard-ware modification. However, the zero activations induceconsiderable redundant computing and degrade the per-formance which is illustrated in [5]. Although many CNNaccelerators [5], [6], [7], [8], [9], [10]are able to skip the zeroactivations during the computing through additional zerodetection logic, they typically can only skip a portion of thezero activations especially the ones that are located on theboundary of the feature maps. However, the zero padding

arX

iv:1

907.

0177

3v3

[cs

.LG

] 2

9 A

pr 2

020

Page 2: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

2

TABLE 1The number of multiply-add operations in the inference phase.

deconvolution approach as shown in Figure 1(b) has manyzero activations inserted between the non-zero activationsand they are usually difficult to be removed due to alignedcomputing data flow on the parallel computing units inDNN accelerators.

To improve the computing efficiency of deconvolution,the authors in [11] opted to build independent processorengines for convolution and deconvolution operation re-spectively. This approach raises a large portion of hardwareresources and chip area increase. Different from the abovetwo approaches, the authors in [5] and [12] proposed torevisit the convolutional processor and change the micro-architecture to support both convolution and deconvolutionefficiently in a unified processor. In addition, these methodsalso need dedicated data flow scheduler to make use ofthe computing engine. For unified architectures, the ad-vantage is better performance and hardware utility, whilethe disadvantage is the additional redesign and engineeringcost. However, for off-the-shelf CNN processors withoutspecialized deconvolution support such as Diannao [7] andTPU [13], the inefficiency and resource under-utilizationinduced by the zero-padding approach is an inevitable costto implement the deconvolutional layers.

Inspired by the prior work, we seek to support fast andefficient deconvolution layer implementation on generalCNN processors like Eyeriss [6], Diannao [7] and TPU [13],some of which are already commercialized and widely-usedin different areas. For these classic CNN processors, manyzero-value activations must be padded to the feature map inorder to map the deconvolution layers on to them. Instead ofzero-padding that induces numerous redundant computingoperations, we tailor a novel implementation of deconvo-lution layer from the software angle, and pre-partition thedeconvolutional filters into multiple small convolutionalfilters, so that the deconvolution operations are convertedand can be efficiently implemented on any CNN processorwithout redesigning or replacing them. In our evaluationon classic CNN processors, the performance as well asthe energy efficiency of our deconvolution implementationremains competitive compared to prior work of specializedGAN processors.

In summary, our contributions can be summarized asfollows:

• We proposed a novel filter partitioning and reorgani-zation approach to convert a general deconvolutionoperation to multiple standard convolution oper-ations strictly without incurring much computingredundancy such that deconvolution can be imple-mented efficiently as convolution.

• We investigated the way to reorganize the split de-convolution results efficiently on legacy neural net-work processors without hardware modification.

• We evaluated the proposed deconvolution perfor-mance on a set of representative benchmarking net-works with comprehensive experiments, the exper-iments show that the proposed approach achievescompetitive performance over the state-of-the-art de-convolution processors on both general CNN pro-cessors and the most advanced commodity deeplearning processor chips such as Google TPU andIntel Neural Compute Stick 2, which are releasedrecently.

The rest of this paper is organized as follows. SectionII presents the related work of deconvolution accelerationdesign. Section III describes the architecture of typical CNNprocessors. In Section IV, we elaborate the conversion pro-cess of generic split deconvolution in detail. At length,Section V presents the evaluation results and Section VIconcludes the paper.

2 RELATED WORK

With the advancements of deep learning, various neuralnetworks have been proposed to address different taskssuch as objection detection and image classification. Amongthem, generative neural networks are demonstrated to beparticularly efficient for content generation tasks like imagestyle transfer [3], [4], [14], segmentation tasks such as [15],and pose estimation tasks such as [16]. These novel neuralnetworks attract a lot of attentions. Ledig et al. [17] proposedSRGAN and adopted a perceptual similarity loss to gener-ate detailed images from low-resolution images. By usinggenerative adversarial networks (GANs), high-resolutionimages of small objects can be generated and utilized toimprove target detection accuracy [18]. Generative neuralnetworks can also be applied for sequence data generationas presented in SeqGAN [19] and ORGAN [20]. Addition-ally, more variants of generative neural networks have beendeveloped and employed in semi-supervised learning andthe medical field [21], [22].

However, generative neural networks that consist of bothcompute-intensive convolution and deconvolution opera-tors cannot be fitted to the conventional CNN processorsdirectly [6], [7], [8], [23]. As deconvolution is also computing

Fig. 1. Computational process of (a) deconvolution and (b) deconvolu-tion with inserted zero-values.

Page 3: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

3

intensive and hinders the acceleration of generative neu-ral networks on CNN processors, thereby, it is highly de-manded to explore hardware acceleration of deconvolutionoperations. Zhang X et al. in [11] proposed to optimize de-convolution with reverse looping and stride hole skipping.Despite the excellent performance, combining independentconvolution and deconvolution components in an processorinduces considerable chip area and power consumption.Amir Y et al. in [12] proposed to convert deconvolution toconvolution by adding zero padding to the activations andthen developed a unified MIMD-SIMD processor for bothoperations. In addition, it implemented a set of distributedon-chip buffers to avoid the redundant computing broughtby the inserted zero activations. Based on [12], the authorsfurther developed an end-to-end template-based solution,which can generate the optimized synthesizable unifiedprocessor from a high-level specification of GANs in [24].Instead of adding zeros to input feature map, Xu et al. in [5]proposed a unified FCN processor on top of a bi-directionsystolic array. The FCN processor performs the computingon original input features. The weight and data of adjacentPEs are shared and passed periodically by taking advantageof the small column buffers added to the 2D PE array. Simi-lar to [5], Wang et al. in [25] designed a uniform architectureto support both 2D and 3D deconvolutional neural networkson FPGAs. Multiple FIFOs are added to adjacent PEs todeliver the overlapped temporary results. Yan et al. in [26]proposed a cold buffer to store the overlapped computingresults for more efficient data reuse and a novel mapping ap-proach to improve the utilization of the computing array forboth convolution and deconvolution. Intel NCS2 [27] alsohas specialized hardware to support native deconvolution,but there are no open technical details. In summary, it canbe found that hardware redesigning is typically requiredto have existing CNN processor to support deconvolutioncomputing in generative neural networks. Due to the longhardware design cycle, many commodity neural networkprocessor including Google Edge TPU [13] and Ropal Neu-ral Compute Stick Lightspeeur SPR2801 [28] still do notsupport the raw generative neural networks yet. Besides themainstream solutions that accelerating neural networks onASIC or FPGAs, the implementations on Resistive RandomAccess Memory (ReRAM) are also one of the recent researchhotpots. F. Chen et al. [29] accelerates GANs using a filterdeformation method to completely eliminate inserted zerosin deconvolutional layers based on 3D horizontal ReRAMarchitecture. However, due to the technological limitations,the implementation of neural networks accelerators onReRAM is still being explored. The purpose of this paperis to propose a acceleration method for GANs to be reusedon existing mainstream accelerators.

Different from the above works, researchers seek to reusethe conventional CNN processors for generative neuralnetworks without hardware redesigning. Shi et al. [30]presented a simple example of transformation from de-convolution to convolution by padding zeros to the inputfeature maps. However, the fixed zero-padding to the rightand bottom of the input features only works for the firstpartition of the split deconvolution and it can cause errorswhen this zero-padding is utilized for the deconvolutionconversion. The correct padding must be adapted to the

Fig. 2. Dot-production based CNN processor [7], [8], [9], [10]

Fig. 3. Regular 2D array CNN processor [5], [6], [13]

deconvolution partition as well as the output feature crop-ping strategies to ensure equivalent output to the rawdeconvolution. In addition, this work is posted as a blogwith limited experiments and has not gone through thepeer review. Chang et al. [31] utilized filter deformation andproposed an approximate conversion approach targetingat super-resolution image reconstruction problems. Whilesuper-resolution image reconstruction can typically toleratecomputing errors, the approximate conversion approachworks fine but it cannot be applied to general generativeneural networks that are not necessarily fault-tolerant. In ad-dition, the approach proposed in [31] needs to rearrange thedeconvolutional results on CPU instead of CNN processorsand it can cause massive data communication between CPUand the processors. To address the above problems, we aimto develop a software approach that can deploy generativeneural networks directly on the existing CNN processorswithout precision penalty nor hardware modification.

3 TYPICAL CNN PROCESSORS

This section briefly explains the architecture of the twomainstream architectures of general CNN processors as-sumed in this paper, including dot-production array proces-sor and regular 2D array processor. Most of the prior CNNprocessors can be included in these two typical architectures[5], [6], [7], [8], [9], [10], [13].

Page 4: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

4

3.1 Dot-production array processor

Figure 2 shows the dot-production based CNN proces-sor. It consists of Dout neural processing units.Each neuralprocess unit includes Din multipliers as well as an addertree and performs a dot production.The same Din inputactivations are fed concurrently to each processing unit percycle while the weights are different. Each unit accepts Din

weights per cycle and Din × Dout parameters needs to besent to the array per cycle. In each PE, Din partial resultsobtained from multipliers are consumed by the adder tree,and a dot production can be completed each cycle because ofthe pipelined processing architecture. Once a filter windowis processed, the output result is sent to an activation func-tion unit and the result is transferred to the output buffer.Each output activation produced by the processing unitbelongs to a different output channel. When weight or datacannot be accommodated by the on-chip buffers, the neuralnetworks will be tiled to fit to the architecture. Diannao[7], Dadiannao [8], C-brain [9] and Cnvlutin [10] are typicaldesigns that adopt the dot-production array architecture.

3.2 Regular 2D array processor

Another typical CNN processor architecture with regular2D PE array is illustrated in Figure 3. Compared to theformer structure, it mainly differs on the data flow. The dataflow used in this work is output stationary (OS) accordingto the definition in Eyeriss [6]. Basically, each PE in thearray performs all the operations required to yield an outputactivation. The weights are fed from the first column of thearray and flow across the PEs from left to right to guaranteethat all PEs operate in full scale. The input activations arebroadcast to all the PEs in a column, but we have at mostone PE column to receive the input activations alleviatingthe pressure on on-chip buffers bandwidth. Each row ofthe PE array produces output activations of one outputfeature map on y-axis. Each column PE produces the outputactivations belonging to different output feature maps butthe same pixel positions. Under the circumstances, bothinput activations and weights consume a limited amountof on-chip memory bandwidth. This architecture enablesthe proposed system to achieve high reusability and even-tually benefits more on boosting throughput in a limitedbandwidth provision. This feature is handy in reducing thebandwidth demand caused by weight/activation load andstore, especially when the processors are sharing the on-chipstorage space and bandwidth with other application proces-sors in nowadays heterogeneous SoCs adopted in mobileand embedded systems. Eyeriss [6], TPU [13], FCN-Engine[5] are typical designs that adopt the 2D array architecture

4 THE PROPOSED SPLIT DECONVOLUTION

In Section IV A, we analyze the correlation betweenthe convolution and deconvolution and brief the idea ofconverting a deconvolution operation to generic convolu-tions. Then we present the detailed conversion steps fromgeneric deconvolution operations to standard convolutionoperations in Section IV B.

Fig. 4. (a) Convolutional layer (b) Deconvolutional layer (c) Split de-convolution that converts a deconvolution layer to multiple convolutionlayers

4.1 Correlation between Convolution and Deconvolu-tion

Convolution and deconvolution are the major sources ofoverhead in generative neural networks. Figure 4(a) and 4(b)show the basic computing patterns of the two operations.In convolution i.e. Figure 4(a), windows of input featuresare convolved with the corresponding filters first. Thenthe results are added up to obtain an output element ofthe output feature. In deconvolution i.e. Figure 4(b), eachelement of the input feature maps is multiplied to eachweight matrix first. Then the production in the overlapped

Algorithm 1 Convolution & DeconvolutionRequire: oh, ow, oc, IC ,KH ,KW

Ensure: output(oh, ow, oc)1: function CONV(oh, ow, oc, IC ,KH ,KW )2: for (ic = 0; ic < IC ; ic ++) do3: for (kh = 0; kh < KH ; kh ++) do4: for (kw = 0; kw < KW ; kw ++) do5: output(oh, ow, oc) += input(oh × s +

kh, ow × s+ kw, ic)× w(kh, kw, ic, oc)

6: function DECONV(oh, ow, oc, IC ,KH ,KW )7: left = max(0, ceil((ow − KW / s)))8: right = min(IW − 1, left + ceil(KW / s))9: top = max(0, ceil((oh − KH / s)))

10: bottom = min(IH − 1, top + ceil(KH / s))11: for (ic = 0; ic < IC ; ic ++) do12: for (ih = top; ih < bottom; ih ++) do13: for (iw = left; iw < right; iw ++) do14: output(oh, ow, oc) += input(ih, iw, ic) ×

w(oh − ih × s, ow − iw × s, ic, oc)

Page 5: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

5

Algorithm 2 Split DeconvolutionRequire: oh, ow, oc, IC ,KH ,KW

Ensure: output(oh, ow, oc)1: function SPLITDECONV(oh, ow, oc, IC ,KH ,KW )2: // weights of the N identical split3: // convolution: w(KTH ,KTW , IC )4: for (n = 0;n < N ;n++) do5: for (ic = 0; ic < IC ; ic ++) do6: kth ← KTH − 17: for (kh = bn/sc; kh < KH ; kh+ = s) do8: ktw ← KTW − 19: for (kw = n%s; kw < KW ; kw+ = s) do

10: w(n, kth, ktw, ic)← w(kh, kw, ic)11: ktw ← ktw − 1

12: kth ← kth − 1

13: Conv(onh, onw, o

nc, IC ,KTH ,KTW )

14: Reorganize the obtained nth output activation

position will be accumulated as the final output activation.By definition, convolution and deconvolution is completelydifferent.

In order to reuse the conventional CNN processors fordeconvolution operations, we further analyze the comput-ing patterns of convolution and deconvolution. The pseudocode of the two operations for computing one output ac-tivation are presented in Algorithm 1. Note that IC andOC indicate the input and output channel of the featuremap. IH and IW denote the length and width of the inputfeature map, and OH , OW are the length and width ofthe output feature map. KH and KW is the length andwidth of the filter. s refers to stride. The notations willbe used through this paper. Basically, convolution can becomputed with an elementwise approach, while deconvolu-tion is consist of multiple group convolution. Each outputactivation of convolution i.e. output(oh,ow,oc) is the accu-mulation of production of input feature windows ([oh ×s, oh × s + KH ), [ow × s, ow × s + KW ))with consecutiveweight matrices. For deconvolution, each output activationis also the accumulation of production of input feature mapwindow and a set of weights using the same computingfunction Convolution except that the weights are selectedwith stride s and reassigned new coordinates in the nthgroup. Meanwhile, the output belonging to different groupsneeds to be reorganized in the final output feature map.

With this observation, we proposed a split deconvolutionapproach as shown in Algorithm 2 which divides the de-convolution filters into multiple smaller filters with strides. In this case, the split filters become consecutive and eachdeconvolution operation is converted to multiple standardconvolution operations. Accordingly, deconvolution can bedeployed on conventional CNN processors without anyhardware modification. While we need to split the filterand reorganize the obtained activations, detailed conversionapproach will be elaborated in the next subsection.

4.2 Generic Deconvolution Conversion

Following the above idea, we convert generic deconvo-lution operation to a set of independent convolution op-

Fig. 5. Conversion steps from deconvolution to convolution, it consistsof four steps. 1) The filter is expanded when the filter size is not divisibleby the stride. 2) Split the deconvolution filters to multiple small filtersaccording to Equations (4-6). 3) The padded input feature maps con-volve with the split filters. 4) Reorganize the split deconvolution resultsto construct the expected deconvolution output by Equations (10) and(11).

erations. The conversion roughly consists of four steps, asshown in Figure 5.

The first step is the weight preprocessing in which theoriginal deconvolutional filters will be expanded with zeroson the top and left side when its length and width is notdivisible by stride s. It ensures that the deconvolution can beconverted to multiple identical convolution operations. Thepadded zeros will expand the output accordingly while theorientation of the padded zeros guarantees that the centerof the expanded output covers the standard deconvolutionoutput. The expanded length and width PK can be cal-culated with Equation (1) where KT is the split filter size(assuming it is square) and can be obtained from Equation(2).

PK = s×KT −K (1)

KT = ceil(K / s) (2)

The second step is to split the deconvolution filters intomultiple small filters with sampling and rotation. Figure 6illustrates the coordinate distribution of filters before andafter the conversion with a small but representative exam-ple. To compute an output deconvolution activation withstandard convolution operations, filters need to be sampledwith stride s and reorganized into new filters. In addition,each sampled filter needs to be rotated 180 degrees to ensurecorrect computing. Equation (3) presents the generic conver-sion. Each deconvolution will be split into s2 convolutionoperations. The stride of the split convolution operations isconstant 1. Without loss of generality, suppose Wn is the nthconvolutional filter. It can be obtained with Equation (4-8)

Page 6: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

6

Fig. 6. Weights distribution for an output activation in original deconvo-lution and split deconvolution where the filter is 4 by 4 and the stride is2.

where W is the deconvolution filter, (y, x) is the original filtercoordinate and (yn, xn) is the new coordinate.

N = s2 (3)

n = s×mod(y, s) +mod(x, s) (4)

Wn[yn][xn] = W [y][x] (5)

{xn = KT − ceil(x / s)

yn = KT − ceil(y / s)(6)

where {0 ≤ x < K + PK

0 ≤ y < K + PK(7)

{0 ≤ xn < KT

0 ≤ yn < KT(8)

n ∈ {0, 1, 2, ...N − 1}

Step 1 and Step 2 basically split the deconvolution filtersto multiple small convolution filters. This needs to be doneonly once and can be reused. Therefore, they can be doneoff-line with software approach. Unlike the first two steps,Step 3 and 4 are performed on the CNN processors foreach input feature map. In step 3, the input feature mapsalso need to be padded with zeros to obtain equivalentdeconvolution output. Otherwise, the output activations onthe edge will be ignored. PI columns/rows of zeros will beadded where PI is obtained from Equation (9).

PI = KT − 1 (9)

Finally, the N split convolution outputs need to be mergedto form the deconvolution output. The reorganization pat-tern is illustrated in Figure 7 and formulated in Equations

Fig. 7. Demonstrate the redistribution process for multiple groups ofoutput activations.

(10-13). Contrary to the filter splitting process, we pick an el-ement of each convolution output to construct an ss windowin the deconvolution output. Note that ConvO n[xi][yi] rep-resents the nth split convolution output and O[xf ][yf ] refersto the expected deconvolution output. Suppose (yi, xi) iscoordinate of a split convolution output and (yf , xf ) isthe coordinate of deconvolution output. The reorganizationhere does not need additional hardware as long as thepartial convolution output can write the buffers with strides which is usually allowed in generic CNN processorssupporting tiling.

O[xf ] = ConvO n[xi]× s+mod(n / s) (10)

O[yf ] = ConvO n[yi]× s+ floor(n / s) (11)

where {0 ≤ xi < I + 2PI −KT + 1

0 ≤ yi < I + 2PI −KT + 1(12)

{0 ≤ xf < (I + 2PI + 1)× s+K + PK

0 ≤ yf < (I + 2PI + 1)× s+K + PK(13)

With the above four steps, we can convert generic de-convolution operations to split convolution operations andapply deconvolution on an unmodified CNN processor.In spite of the hardware compatibility, the proposed splitdeconvolution approach may extend the filters and inputfeature maps, which will induce additional computing over-head. On the other hand, the padding are zeros and canbe potentially skipped by the conventional CNN processoroptimizations. Detailed evaluation on realistic benchmarkswill be discussed in the experiments.

5 EXPERIMENTS

This section consists of three parts. First, we listed thesetting of the selected benchmarks and the experimentalenvironment. Then we evaluate the performance and energyconsumption of split deconvolution on the generic proposeprocessors. At last, the approach is compared with two off-the-shelf processors i.e. Google Edge TPU and Intel NCS2.The proposed SD algorithm and its deployment on the neu-ral network processors are open sourced and can be foundin https://github.com/warmthless/split-deconvolution.

Page 7: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

7

5.1 Experimental setupTo perform comprehensive evaluation of the proposed

split deconvolution computing approach, we conduct exper-iments on both simulation-based neural network processorsand commodity neural network processors provided by thechip vendors, and then compare proposed methods withprior deconvolution computing approaches.

For the simulation-based evaluation, we developed cycle-accurate neural network simulators for both the dot-production based neural network processor architectureand the regular 2D array architecture. Both the 8-bit dot-production PE array and the 2D PE array are implementedand synthesized with Synopsys Design Compiler (DC) un-der TSMC 40nm library. The dot-production based architec-ture includes 16 processing units, and each unit performsdot production on 16 input activations and weights. The 2DPE array is set to be 32 by 7. The I/O buffer size is set to be256 KB, weight buffer is 416 KB. Both processors run at 800MHz.

For the commodity neural network processors, we choosetwo representative ones. One of them is Edge TPU [13] fromGoogle and it does not support native deconvolution oper-ations. To implement deconvolution on it, we convert thedeconvolution to standard convolution using zero padding[6]. The other processor chip is the latest NCS2 [27] fromIntel. It supports native deconvolution operation and thedeconvolution is applied directly on the optimized architec-ture of NCS. The performance on the commodity processorsis measured using the system clock.

To evaluate the different deconvolution approaches, weselected a set of advanced neural networks as our bench-marks including ArtGAN [14] on Cifar 10 (ArtGAN), DC-GAN [32] on Large-scale CelebFaces Attributes Dataset(DCGAN), Spectral Normalization for GAN [33] on Cifar10 (SNGAN), GP-GAN on Transient Attributes Database[34] (GP-GAN) for generating new datasets. Unsuper-vised Monocular Depth Estimation of FCN on KITTI andCityscapes [4] (MDE) aims of image segmentation, and Fast-Style-Transfer [3] on CoCo2014 which is used to apply thestyle of one image to another image (FST).

5.2 Experimental results on general CNN processorsThis section illustrates how the proposed split deconvolu-

tion improves the performance and efficiency of generativeneural networks on the simulated general CNN processorsincluding both dot-production array and 2D array architec-tures.

TABLE 2Comparison of multiply-add operands (deconvolution layers) for three

different implementations

TABLE 3Comparison of weight parameters (deconvolution layers) for three

different implementations

5.2.1 Operation number and parameters comparison

Multiply-add (MAC) operation takes up the majority ofthe computing in neural networks, so the number of MACsexhibits the computing intensity of the neural networksdirectly and it is independent with the underlying comput-ing architectures. Thereby, we use this metric to comparethe different deconvolution computing approaches. Table 2shows the number of MACs in original neural networks,neural networks using native zero padding (NZP) and neu-ral networks using the proposed split deconvolution (SD). Itcan be observed that NZP incurs a large number of redun-dant operations compared with the original deconvolution.Compared to NZP, SD brings in much less computing.It does not incur any additional computing overhead inSNGAN, ArtGAN and GP-GAN and induces only a portionof additional computing on the rest of the neural networks.

In theory, SD will not increase the amount of the computa-tion. The deformation approach proposed in [29] transformsfilter into different shapes, which does not introduce redun-dant parameters. But for some of the legacy accelerators,they may not support filters with irregularly shapes in thelayer. On this occasion, zeros need to be added to furtherapply to map deconvolutional layers on general-proposearchitectures. When the original filter length or width is notdivisible by the stride s in the according neural networks,we need to pad zeros on the top and left side of the filtersto ensure identical filter splitting. This neat zero value canbe easily compressed on an accelerator with a particulardata format. Table 3 lists the number of weight parametersof original neural networks [29], general SD approach andSD with compressed weight parameters. Though there areinduced zeros of weight in some benchmarks (DCGAN,MDE and FST), most of the redundant values have beenbeen removed after the compression. In addition, the splitdeconvolution may produce only the center area of theoriginal deconvolution output feature maps, and we mustadd zero paddings to the input feature maps to obtainequivalent deconvolution output feature maps. Thereby,the proposed split deconvolution may add zeros to boththe weights and the input activations, and induce morecomputing depending on the neural network parameters.

The purpose of padding zeros on filters and input featuremaps is to make the SD approach to be a more general solu-tion to accelerate GANs on legacy accelerators. Meanwhile,this induced redundant values can be omitted and have noimpact on performance, which is analyzed in detail in thenext section.

Page 8: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

8

Fig. 8. Performance comparison of the deconvolutional layers in the dot-production PE array.

Fig. 9. Performance comparison of the deconvolutional layers in theregular 2D PE array.

5.2.2 Performance comparisonIn this section, we mainly compare the different decon-

volution approaches on typical neural network processors.Although NZP and SD may induce redundant comput-ing, many of the redundant computing can be potentiallysqueezed using the sparse aware optimization techniqueswhich allow the processors to skip the zero multiplications.Generally, there are three different sparse-aware optimiza-tion methods including activation sparse optimization (As-parse), weight sparse optimization (Wsparse) and activationand weight sparse optimization (AWsparse). We exploredthe neural network performance on processors with thedifferent optimization methods. While the processor withdot-production PE array cannot skip zero weights and weonly apply Asparse method on it. In addition, we alsocompare with FCN-engine [5] that had the 2D PE arrayCNN processor redesigned.

Figure 8 depicts the normalized performance of threeacceleration schemes on the dot-production PE array. NZPincurs 75% computing redundancy on average on the bench-mark neural networks when converting the deconvolutionto convolution. Unlike the NZP, split deconvolution hasonly marginal zero paddings on the boundary in somecorner cases. Therefore, it has much less computing redun-dancy, which is projected in the 2.5× performance boost ofSD over NZP. When the specified input activation lines canbe skipped to generate standard deconvolution output, theperformance can further be improved. Notably, SD-Asparseon DCGAN improves by 1.4×. The primary reason lies inthe fact that the DCGAN has fewer network layers andsmaller input feature maps. As a result, the computingredundancy caused by the padding affects the overall per-formance more significantly.

On the 2D PE array CNN processor as shown in Figure 9,

Fig. 10. Energy consumption of the deconvolutional layers in the dot-production PE array.

Fig. 11. Energy consumption of the deconvolutional layers in the regular2D PE array.

SD-Asparse and SD-Wsparse in the experiments show theinfluence of the filter expansion and the input expansion re-spectively. Although SD-Wsparse induces some redundantcomputation due to padding to the input feature maps,most of the convolution processors support zero-skippingand can squeeze the computing redundancy automatically.Compared to SD-Wsparse, SD-WAsparse that enables thezero-skipping reduces 22% redundant computation on av-erage. Similarly, SD-Asparse has zero-padding added tothe weights, and the redundant computing can also beeliminated on a sparse convolution processor architecture.For workloads like DCGAN, FST, and MDE, the filters needto be expanded. In these cases, SD-WAsparse reduces 75%- 80% computing redundancy with zero-skipping. Whenthe split deconvolution is deployed on optimized CNNprocessors, the performance of SD-WAsparse is on par withthat of FCN in all the benchmark neural networks. The de-convolution approach presented in FCN-engine [6] adopts abi-directional data flow. It has implemented the original de-convolution, which is the input activations multiplied witheach filter and then accumulates the overlapped production.By taking advantage of the column buffers, it can transmitthe partial results for accumulation efficiently. However, theoutput feature maps on edge are redundant and need tobe cropped, which inevitably induces computing overhead,especially for smaller deconvolution layers. Therefore, SD-WAsparse outperforms FCN-engine on some of the neuralnetworks like DCGAN, as shown in Figure 9.

Page 9: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

9

5.2.3 Energy consumption comparisonFigures 10 and 11 present the relative energy consumption

distribution of the different deconvolution approaches onthe dot-production PE array and regular 2D PE array respec-tively. Compared to NZP, the average energy consumptionof SD-Asparse and SD-WAsparse reduce by 36.15% and43.63% respectively on the two CNN architectures. Unlikethe performance comparison, the energy consumption com-parison is less significant. In general, the deconvolutionenergy consumption roughly consists of three parts i.e. PE,on-chip buffer, and DRAM. According to the estimationusing CACTI [35], the energy is mostly consumed by theDRAM access and the on-chip buffer access. While theamount of DRAM access of the different deconvolutionapproaches is about the same, their consumption has littledifference across these approaches. Despite the dramaticdifference in PE activity and energy consumption, PE energyconsumption is too small to affect the overall deconvolutionenergy consumption. As a result, the energy consumptiondifference is primarily determined by the amount of on-chipbuffer accesses, which explains all the energy consumptiondifference. For example, SD-Asparse induces relatively moreweight reading and thus higher energy consumption. Sim-ilarly, FCN requires additional on-chip buffers to supportthe unified convolution and deconvolution, so the overallenergy consumption is higher than that of SD-WAsparse inall the benchmark networks, though their performance isquite close to each other.

5.2.4 SD-based DCGAN DemoAs we need to manipulate the output write instruction of

the neural network processor to reorganize the split decon-volution outputs for the equivalent deconvolution output,we apply the proposed SD approach on an edge AI systemof which we can touch the low-level output write instruc-tions to demonstrate the use of the proposed SD algorithm.Note that the edge AI is produced by [36]. It consists of bothRISC-V cores and a neural network processor fabricatedwith TSMC 40nm technology. Each split convolution is exe-cuted sequentially on the neural network processor and theconventional sequential output write instruction is replacedwith a stride write instruction which is widely supportedin DMA cores. On this AI system, we implemented a facegeneration demo using DCGAN as shown in Figure 12. Theend-to-end performance comparison with NZP is consistentwith that obtained in Figure 9, which demonstrates thecomputing efficiency of SD on neural network processors.

5.2.5 Deconvolution Conversion Quality EvaluationIn this section, we mainly evaluate the quality of the

deconvolution results calculated using different deconvolu-tion conversion approaches. We compare the results to thatgenerated from the raw deconvolution with SSIM metric[37] which is widely utilized to measure the similaritybetween images. SSIM ranges from 0 to 1 and higherSSIM indicates higher similarity between the images. Thecomparison is shown in Table 4. It can be observed thatSD produces identical results for both DCGAN and FST,while the methods proposed in [30] and [31] produce dif-ferent results and the SSIM of the same deconvolution

Fig. 12. The DCGAN demo on edge system jeejio JX2 [36]

conversion approach varies on different generative neuralnetwork models. Particularly, the approach in [30] resultsin considerable computing errors in DCGAN while minorcomputing errors in FST. This is mainly caused by the factthat the input images in FST are larger and the influenceof the wrong padding on the boundaries is less significant.Moreover, the proportion of deconvolution in FST is smallerthan that in DCGAN, which also explains the higher SSIMmetric in FST. To further illustrate the effect of the errors onthe generated images, we also display the images generatedusing different deconvolution approaches in Figure 13 and14. It can be seen that the quality of the images calculatedwith the different deconvolution conversion approaches isroughly consistent with the SSIM metric. Basically, FSTusing the deconvolution conversion approach proposed in[30] seems to be acceptable visually. However, the generatedimages in the rest cases differ dramatically and can not betolerated or utilized.

Fig. 13. The generated images of DCGAN [32]. (a) SD approach (b) Theapproach base on [30] (c) The approach based on [31]

5.3 Experiments on commodity NN processorsThis section shows the use of SD for generative neural

networks on the most advanced commodity CNN proces-

TABLE 4SSIM value comparison

Page 10: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

10

Fig. 14. The generated images of FST [3]. (a) SD approach (b) Theapproach base on [30] (c) The approach based on [31]

Fig. 15. Performance comparison on Edge TPU.

sor chips including Google Edge TPU without specializeddeconvolution support and Intel NCS2 with specializeddeconvolution operation support. As we cannot touch theinternal output write instructions in these commodity neu-ral network processors, we can not reorganize the splitconvolution results for the equivalent deconvolution out-puts directly though the stride output write is probablysupported. To demonstrate the use of SD on these neuralnetwork processors, we move the results generated in eachsplit convolution to host and have the host to reorganizethe results for the following neural network operations.Since the data movement is not required given internaldata movement support in the neural network processors,we only take the split deconvolution computing time andthe data reorganization time as the overall deconvolutionexecution time in the experiments.

5.3.1 Edge TPUEdge TPU is a tensor processor with the systolic array

architecture and is usually used as a co-processor of ahost computer. It does not support native deconvolutionoperation, so we apply the NZP approach to implementthe deconvolution on it as the baseline. Meanwhile, we alsoperform split deconvolution (SD) to deploy deconvolutionon TPU. The NZP approach requires zero-padding to theinput feature maps, and the SD approach needs additionaloutput feature reorganization. While this computing cannotbe performed on TPU directly, we have them done on thehost processor.

The normalized acceleration performance of the two de-convolution approaches on Edge TPU is illustrated in Figure15. The proposed SD achieves 1.51× performance speedupover NZP on average. Particularly, FST yields the highestspeedup (1.65×) over NZP on Edge TPU. However, theperformance improvement is much lower than that on CPU,which is not consistent with the number of operations aslisted in Table 2. To explore the underlying reasons, we

Fig. 16. Performance comparison under the same computational effi-ciency on host.

further evaluate the computing efficiency of convolutionwith different input feature map sizes and filter sizes onEdge TPU. The evaluation result is revealed in Table 5and Table 6. The filter size is set to be 3×3, which is afrequent setup for split deconvolution and measure theGiga multiply-add operations per second (GMACPS) givendifferent input feature maps. As shown in Table 5, whenthe size of input feature maps ranges from 8×8 to 128×128,the normalized computational efficiency of Edge TPU i.e.GMACPS increases significantly. Although there are notmuch documents about the detailed computing architec-ture of Edge TPU, it is probably that Edge TPU compilerparallelizes the convolution operations on the 2-D plane ofthe input features and requires larger input feature mapsto make good use of its computing resources. Similarly, wealso investigate the influence of filter sizes on the computingefficiency in Table 6. We set the feature map to be 128×128and change the filter size from 2×2 to 5×5. When wecompare the computing efficiency, it can be observed thatthe convolution with larger filter sizes on Edge TPU isclearly more efficient. For SD that splits the deconvolutionoperations to multiple smaller convolution operations, theresulting convolution is usually less efficient compared tothe NZP based converter. Basically, the computing efficiencyof Edge TPU degrades with smaller feature maps andfilter sizes due to its inherent convolution parallelizationapproach. Thereby, the performance speedup of SD overNZP is lower than that is estimated based on the numberof MACs.

To further verify the above analysis, we have both modelsof NZP and SD run on the host CPU of which the computingefficiency does not vary much under different kernel param-eters. The normalized acceleration performance of the NZPand SD approaches is illustrated in Figure 16. (Note that thehost processor is Intel Core i7-7700 with 3.6 GHz.) It canbe found that the proposed SD achieves 3.04× performancespeedup over NZP on average, which is roughly consistentwith the magnitude of operation reduction presented inTable 2. Particularly, the performance speedup goes up to3.60× on GP-GAN. Similar to Figure 9, the average perfor-mance improvement of DCGAN, FST, and MDE is relativelylower than that of SNGAN, ArtGAN, and GPGAN dueto the additional parameters padded to the filters and theinput features during splitting. This confirms the analysisthat SD does reduce the amount of computing comparedto that in NZP but the converted convolution with smaller

Page 11: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

11

TABLE 5Normalized GMACPS for different input feature map size on Edge TPU

TABLE 6Normalized GMACPS for different filter size on Edge TPU

kernel sizes and lower computing efficiency affects theperformance speedup. If the neural network processorsimprove its computing efficiency for smaller convolutionkernel sizes, the performance speedup of SD over NZP willbe higher accordingly.

5.3.2 Intel Neural Compute Stick 2 (NCS2)NCS2 is a neural network processor produced by Intel,

and it includes specialized hardware to support nativedeconvolution operation. We evaluated the deconvolutionallayers of generative neural networks on it with the decon-volution operations implemented using the NZP approach,the SD approach as well as the native deconvolution. The ex-periment is presented in Figure 17. When compared to NZP,the proposed SD performs 1.67× performance speedup overthe NZP approach. Similarly to Edge TPU, its performancespeedup is lower than that analyzed with MACs. Therefore,we also evaluate the influence of different feature map sizeand filter size and the result is shown in Table 7 and Table 8with the same configurations as Edge TPU. And we noticethat the lower computing efficiency of smaller convolutionkernel sizes on NCS2 is the major reasons for the lowerperformance speedup.

While NCS2 also includes specialized hardware for nativedeconvolution operation, we further evaluated the decon-volutional layers of generative neural networks on it withthe optimized deconvolution. Even compared to the nativedeconvolution implementation on NCS2, the proposed SDapproach still yields 1.10× performance speedup on aver-age. Despite the degraded computing efficiency of NCS2on the split convolution kernels, the proposed SD approachstill show higher performance NCS2 without any hardwaremodification.

6 CONCLUSION

Prior generative neural network acceleration may eitherrequire intensive hardware modification of existing CNNprocessors or bring in large amount of redundant comput-ing because the involved deconvolution operations cannotbe fitted to the conventional CNN processors directly. To

Fig. 17. Performance comparison on Intel Neural Compute Stick 2.

address this problem, we proposes to convert the deconvo-lution to standard convolution with a software approach.The basic idea is to investigate the computing patterns ofdeconvolution and formulate it as convolution computingpatterns. The resulting convolution filters can be obtainedby splitting the original deconvolutional filters while theconvolution results need to be reorganized to construct theoriginal deconvolution results. This approach incur littlecomputing redundancy, and thus enables fast and efficientdeconvolution execution on legacy deep learning proces-sors. With comprehensive experiments, we demonstrate thatSD achieves 2.41× 4.34× performance speedup over thenave zero padding methods and is on par with the prioroptimized implementation on modified fully convolutionneural network processor. Moreover, the proposed approachis also beneficial to commodity neural processors. It yields1.51× performance speedup compared to the nave zeropadding on Google Edge TPU which does not have nativedeconvolution support. When compared to Intel NCS2 chipswith native deconvolution support, it still achieves 1.1×performance speedup on average though the computingefficiency of NCS2 degrades with the split convolutionkernels.

TABLE 7Normalized GMACPS for different input feature map size on NCS2

TABLE 8Normalized GMACPS for different filter size on NCS2

Page 12: 1 Accelerating Generative Neural Networks on Unmodified ... · Software Approach Dawen Xu, Member, IEEE, Ying Wang, Member, IEEE, Kaijie Tu, Member, IEEE, Cheng Liu, Member, IEEE,

12

REFERENCES

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial

[3] L. Engstrom, “Fast style transfer,” https://github.com/lengstrom/fast-style-transfer/, 2016.

[4] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervisedmonocular depth estimation with left-right consistency,” in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2017, pp. 270–279.

[5] D. Xu, K. Tu, Y. Wang, C. Liu, B. He, and H. Li, “Fcn-engine:Accelerating deconvolutional layers in classic cnn processors,” inProceedings of the International Conference on Computer-Aided Design.ACM, 2018, p. 22.

[6] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture forenergy-efficient dataflow for convolutional neural networks,” inACM SIGARCH Computer Architecture News, vol. 44, no. 3. IEEEPress, 2016, pp. 367–379.

[7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,“Diannao: A small-footprint high-throughput accelerator for ubiq-uitous machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4.ACM, 2014, pp. 269–284.

[8] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercom-puter,” in Proceedings of the 47th Annual IEEE/ACM InternationalSymposium on Microarchitecture. IEEE Computer Society, 2014, pp.609–622.

[9] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-brain: A deep learning accelerator that tames the diversity ofcnns through adaptive data-level parallelization,” in 2016 53ndACM/EDAC/IEEE Design Automation Conference (DAC). IEEE,2016, pp. 1–6.

[10] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger,and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neuralnetwork computing,” ACM SIGARCH Computer Architecture News,vol. 44, no. 3, pp. 1–13, 2016.

[11] X. Zhang, S. Das, O. Neopane, and K. Kreutz-Delgado, “A designmethodology for efficient implementation of deconvolutional neu-ral networks on an fpga,” arXiv preprint arXiv:1705.02583, 2017.

[12] A. Yazdanbakhsh, K. Samadi, N. S. Kim, and H. Esmaeilzadeh,“Ganax: A unified mimd-simd acceleration for generative adver-sarial networks,” in Proceedings of the 45th Annual InternationalSymposium on Computer Architecture. IEEE Press, 2018, pp. 650–661.

[13] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal,R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in2017 ACM/IEEE 44th Annual International Symposium on ComputerArchitecture (ISCA). IEEE, 2017, pp. 1–12.

[14] W. R. Tan, C. S. Chan, H. E. Aguirre, and K. Tanaka, “Artgan:Artwork synthesis with conditional categorical gans,” in 2017IEEE International Conference on Image Processing (ICIP). IEEE,2017, pp. 3760–3764.

[15] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille, “Semantic image segmentation with deep convolutionalnets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.

[16] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks forhuman pose estimation,” in European conference on computer vision.Springer, 2016, pp. 483–499.

[17] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham,A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adver-sarial network,” in Proceedings of the IEEE conference on computervision and pattern recognition, 2017, pp. 4681–4690.

[18] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptualgenerative adversarial networks for small object detection,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2017, pp. 1222–1230.

[19] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence gener-ative adversarial nets with policy gradient,” in Thirty-First AAAIConference on Artificial Intelligence, 2017.

nets,” in Advances in neural information processing systems, 2014, pp.2672–2680.

[2] I. Korshunova, W. Shi, J. Dambre, and L. Theis, “Fast face-swapusing convolutional neural networks,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2017, pp. 3677–3685.

[20] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C.Farias, and A. Aspuru-Guzik, “Objective-reinforced generativeadversarial networks (organ) for sequence generation models,”arXiv preprint arXiv:1705.10843, 2017.

[21] L. Chongxuan, T. Xu, J. Zhu, and B. Zhang, “Triple generative ad-versarial nets,” in Advances in neural information processing systems,2017, pp. 4088–4098.

[22] D. Yang, T. Xiong, D. Xu, Q. Huang, D. Liu, S. K. Zhou, Z. Xu,J. Park, M. Chen, T. D. Tran et al., “Automatic vertebra labelingin large-scale 3d ct using deep image-to-image network withmessage passing and sparsity regularization,” in International Con-ference on Information Processing in Medical Imaging. Springer, 2017,pp. 633–644.

[23] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “Deepburning: automaticgeneration of fpga-based learning accelerators for the neural net-work family,” in Proceedings of the 53rd Annual Design AutomationConference. ACM, 2016, p. 110.

[24] A. Yazdanbakhsh, M. Brzozowski, B. Khaleghi, S. Ghodrati,K. Samadi, N. S. Kim, and H. Esmaeilzadeh, “Flexigan: An end-to-end solution for fpga acceleration of generative adversarialnetworks,” in 2018 IEEE 26th Annual International Symposium onField-Programmable Custom Computing Machines (FCCM). IEEE,2018, pp. 65–72.

[25] D. Wang, J. Shen, M. Wen, and C. Zhang, “Towards a uniformarchitecture for the efficient implementation of 2d and 3d decon-volutional neural networks on fpgas,” in 2019 IEEE InternationalSymposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.

[26] J. Yan, S. Yin, F. Tu, L. Liu, and S. Wei, “Gna: Reconfigurableand efficient architecture for generative network acceleration,”IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 37, no. 11, pp. 2519–2529, 2018.

[27] “Intel neural compute stick 2,” https://software.intel.com/en-us/neural-compute-stick.

[28] “Ropal neural compute stick, lightspeeur spr2801,” https://www.ropal.com.cn/.

[29] F. Chen, L. Song, H. Li, and Y. Chen, “Zara: A novel zero-free dataflow accelerator for generative adversarial networks in3d reram,” in 2019 56th ACM/IEEE Design Automation Conference(DAC), 2019, pp. 1–6.

[30] W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig, andZ. Wang, “Is the deconvolution layer the same as a convolutionallayer?” arXiv preprint arXiv:1609.07009, 2016.

[31] J.-W. Chang and S.-J. Kang, “Optimizing fpga-based convolu-tional neural networks accelerator for image super-resolution,”in Proceedings of the 23rd Asia and South Pacific Design AutomationConference. IEEE Press, 2018, pp. 343–348.

[32] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-tation learning with deep convolutional generative adversarialnetworks,” arXiv preprint arXiv:1511.06434, 2015.

[33] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectralnormalization for generative adversarial networks,” arXiv preprintarXiv:1802.05957, 2018.

[34] H. Wu, S. Zheng, J. Zhang, and K. Huang, “Gp-gan: To-wards realistic high-resolution image blending,” arXiv preprintarXiv:1703.07195, 2017.

[35] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi,“Cacti-p: Architecture-level modeling for sram-based structureswith advanced leakage reduction techniques,” in Proceedings of theInternational Conference on Computer-Aided Design. IEEE Press,2011, pp. 694–701.

[36] “jeejio iot chip jx2,” https://jeejio.com/.[37] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image

quality assessment: from error visibility to structural similarity,”IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612,

2004.