parana: a parallel neural architecture considering thermal

15
IEEE Proof 1 Parana: A Parallel Neural Architecture 2 Considering Thermal Problem 3 of 3D Stacked Memory 4 Shouyi Yin , Shibin Tang, Xinhan Lin, Peng Ouyang, Fengbin Tu , Leibo Liu , 5 Jishen Zhao, Member, IEEE, Cong Xu, Member, IEEE, 6 Shuangcheng Li, Yuan Xie , Fellow, IEEE, and Shaojun Wei 7 Abstract—Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective 8 of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network 9 (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually 10 cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation 11 is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient 12 memory bandwidth. A “3D + 2.5D” integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a 13 highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN 14 acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D 15 DRAM. In this paper, we propose a neural processor called Parana for hybrid-NN acceleration in consideration of thermal 16 problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses 17 and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division 18 mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short 19 time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented 20 tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana 21 reduces the peak temperature by up to 54.72 C and the steady temperature by up to 32.27 C over state-of-the-art accelerators 22 with 3D memory without performance degradation. 23 Index Terms—Neural processor, hybrid neural networks, thermal problem, 3D memory Ç 24 1 INTRODUCTION 25 A FTER almost two decades of stagnation, Artificial Intel- 26 ligence (AI) now is embracing another peak in its his- 27 tory. The fast-developing deep learning (DL) techniques and 28 their wide deployment serves as the main driving horse of 29 this new technology wave [1], [2], [3], [4]. Evolved from con- 30 ventional neural network (NN) and machine learning (ML) 31 algorithms, DL involves many layers with complex structures 32 and/or multiple nonlinear transformations to model a high- 33 level abstraction of data. The ability to learn tasks from exam- 34 ples makes DL particularly powerful in handling so-called 35 cognitive applications such as image understanding, object 36 detection, voice recognition, scene labeling, and video surveil- 37 lance [1], [2], [3], [4], [5], [6], [7], [8], [9], etc. 38 From the perspective of operation type and network 39 architecture, deep neural networks can be categorized into 40 full convolution-based neural network (ConvNet), recurrent 41 neural network (RNN) and fully-connected neural network 42 (FCNet). ConvNet is composed of convolutional layers and 43 pooling layers, which is good at visual feature extraction. 44 FCNet is composed of multiple fully connected layers, 45 which is good at classification. RNN is composed of fully 46 connected layers with feedback paths, which is good at 47 sequential data processing. In a practical DL application, sev- 48 eral neural networks are usually cascaded as a hybrid neural 49 network (Hybrid-NN). For example, AlexNet [1] and 50 VGGNet-16 [10] which are proposed for image classification 51 are actually the cascade of a ConvNet and FCNet. LRCN [3] is 52 another representative hybrid-NN for visual recognition and 53 description. Fig. 1 summaries the network structures of Alex- 54 Net and LRCN. PathNet[2], proposed by Google-DeepMind, 55 demonstrates the general-purpose artificial intelligence with 56 hybrid-NN. Google-DeepMind also proposed a unified 57 hybrid-NN to implement various intelligent tasks [11]. 58 Hybrid-NNs are usually memory-intensive high-through- 59 put algorithms with large amounts of parallel computation S. Yin, S. Tang, X. Lin, P. Ouyang, F. Tu, L. Liu, and S. Wei are with the Institute of Microelectronics, Tsinghua University, Beijing 100084, China. E-mail: [email protected]. J. Zhao is with the University of California, San Diego, CA 92093-0021. C. Xu, S. Li, and Y. Xie are with the University of California, Santa Barbara, CA 93106. Manuscript received 20 Aug. 2017; revised 13 June 2018; accepted 22 June 2018. Date of publication 0 . 0000; date of current version 0 . 0000. (Corresponding author: Shouyi Yin). Recommended for acceptance by M. Huebner. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2018.2858230 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018 1 1045-9219 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 05-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

1 Parana: A Parallel Neural Architecture2 Considering Thermal Problem3 of 3D Stacked Memory4 Shouyi Yin , Shibin Tang, Xinhan Lin, Peng Ouyang, Fengbin Tu , Leibo Liu ,

5 Jishen Zhao,Member, IEEE, Cong Xu,Member, IEEE,

6 Shuangcheng Li, Yuan Xie , Fellow, IEEE, and Shaojun Wei

7 Abstract—Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective

8 of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network

9 (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually

10 cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation

11 is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient

12 memory bandwidth. A “3D + 2.5D” integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a

13 highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN

14 acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D

15 DRAM. In this paper, we propose a neural processor called Parana for hybrid-NN acceleration in consideration of thermal

16 problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses

17 and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division

18 mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short

19 time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented

20 tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana

21 reduces the peak temperature by up to 54.72 �C and the steady temperature by up to 32.27 �C over state-of-the-art accelerators

22 with 3D memory without performance degradation.

23 Index Terms—Neural processor, hybrid neural networks, thermal problem, 3D memory

Ç

24 1 INTRODUCTION

25 AFTER almost two decades of stagnation, Artificial Intel-26 ligence (AI) now is embracing another peak in its his-27 tory. The fast-developing deep learning (DL) techniques and28 their wide deployment serves as the main driving horse of29 this new technology wave [1], [2], [3], [4]. Evolved from con-30 ventional neural network (NN) and machine learning (ML)31 algorithms, DL involves many layers with complex structures32 and/or multiple nonlinear transformations to model a high-33 level abstraction of data. The ability to learn tasks from exam-34 ples makes DL particularly powerful in handling so-called35 cognitive applications such as image understanding, object

36detection, voice recognition, scene labeling, and video surveil-37lance [1], [2], [3], [4], [5], [6], [7], [8], [9], etc.38From the perspective of operation type and network39architecture, deep neural networks can be categorized into40full convolution-based neural network (ConvNet), recurrent41neural network (RNN) and fully-connected neural network42(FCNet). ConvNet is composed of convolutional layers and43pooling layers, which is good at visual feature extraction.44FCNet is composed of multiple fully connected layers,45which is good at classification. RNN is composed of fully46connected layers with feedback paths, which is good at47sequential data processing. In a practical DL application, sev-48eral neural networks are usually cascaded as a hybrid neural49network (Hybrid-NN). For example, AlexNet [1] and50VGGNet-16 [10] which are proposed for image classification51are actually the cascade of a ConvNet and FCNet. LRCN [3] is52another representative hybrid-NN for visual recognition and53description. Fig. 1 summaries the network structures of Alex-54Net and LRCN. PathNet[2], proposed by Google-DeepMind,55demonstrates the general-purpose artificial intelligence with56hybrid-NN. Google-DeepMind also proposed a unified57hybrid-NN to implement various intelligent tasks [11].58Hybrid-NNs are usuallymemory-intensive high-through-59put algorithms with large amounts of parallel computation

� S. Yin, S. Tang, X. Lin, P. Ouyang, F. Tu, L. Liu, and S. Wei are withthe Institute of Microelectronics, Tsinghua University, Beijing 100084,China. E-mail: [email protected].

� J. Zhao is with the University of California, San Diego, CA 92093-0021.� C.Xu, S. Li, and Y. Xie are with the University of California, Santa Barbara,

CA 93106.

Manuscript received 20 Aug. 2017; revised 13 June 2018; accepted 22 June2018. Date of publication 0 . 0000; date of current version 0 . 0000.(Corresponding author: Shouyi Yin).Recommended for acceptance by M. Huebner.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2018.2858230

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018 1

1045-9219� 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof60 and memory accesses. The acceleration of hybrid-NN

61 requires a large on-chip memory and a high memory band-62 width. 3D stacked memory [12] provides the large capacity63 and the high bandwidth for the high-throughput computa-64 tion. 3D stacking neural processors Neurocube [13] and TET-65 RIS [14] integrate a 3D DRAM on top of the neural processor66 unit (NPU). However, the vertical 3D stacking integration67 easily suffers from customization of through-silicon vias in68 3D DRAM based on the layout of the NPU [15]. Compared to69 the vertical 3D stacking, a “3D + 2.5D” integration system70 integrates a high-bandwidth 3D stacked DRAM side-by-side71 with a highly-parallel NPU on a silicon interposer. It avoids72 the customization of 3D DRAM. Benefited from the advan-73 tages of “3D + 2.5D” integration, high-bandwidth modern74 CPU/GPU are implemented with 2.5D integration, such as75 AMD’s Fiji [16], NVIDIA’s Pascal [17] and Intel’s Knights76 Landing [18]. Such “3D + 2.5D” integration is also explored77 for neural acceleration, demonstrated by the Nervana Engine78 [19]. The “3D + 2.5D” integration system provides a promis-79 ing solution to accelerate the high-throughput neural net-80 work. However, this 3D memory architecture may come out81 a serious thermal problem for integration with NPUs [15]. A82 higher temperature increases the cooling cost [15] and83 reduces the life-time of 3Dmemory [20]. For example, the lat-84 est Google’s TPU3 used 3D memory to help address the85 memory bandwidth problem, but had to use liquid-cooling86 to address the thermal challenge [21]. A straightforward87 approach to lower the average temperature of 3D memory is88 to reduce the number of memory accesses over a certain89 period of time so that the dynamic energy of accessing mem-90 ory is decreased. However, the peak temperature is also91 affected by the transient memory bandwidth. Fig. 2 shows92 different peak temperature over time caused by two different93 memory access behaviors. The total number of memory94 accesses are the same in the two cases. With evenly distrib-95 uted memory access in different phases of the execution, the96 peak temperature stays almost constant over time. In con-97 trast, with the uneven access behavior, the last 20 percent exe-98 cution phase takes 80 percentmemory accesses, resulting in a99 much higher the peak temperature. When the temperature

100 exceeds the maximum operation temperature (85 �C), the101 error rates of DRAM can go up dramatically.102 In this paper, we propose a “3D + 2.5D” integration neu-103 ral processor called Parana for thermal-efficient hybrid-NN104 acceleration. Parana integrates 3D memory and NPU with105 the “3D + 2.5D” solution and tackles the thermal problem of

1063D memory by reducing the total number of memory107accesses and reshaping memory access behaviors. Specially,108Parana balances the memory bandwidth with the help of109spatial division mapping by avoiding bursty memory traf-110fic. To reduce the total number of 3D memory accesses, we111design a novel NPU architecture and optimize the tiling112and scheduling policy to maximize the utilization of on-113chip buffer. The key contributions are summarized as114following:

115� For the first time, we consider the thermal problem116of 3D Memory in NN accelerator, and jointly opti-117mize the total number of memory accesses and the118memory access pattern, to reduce the steady temper-119ature and peak temperature.120� For the first time, we design a spatial division map-121ping (SDM)mechanism to utilize the complementarity122between different NNs, resulting in the smoothness of123the memory access pattern. We propose a divisible124processing element (PE) array for concurrently accel-125erating different NNs in parallel.126� We simultaneously use SDM, sparsity, unified buffer127and dataflow scheduling techniques to achieve joint128optimization. We propose an analysis framework to129find themost profitable tiling and scheduling solutions.

1302 BACKGROUND AND RELATED WORK

131This section presents the background on hybrid-NN, spatial132architecture, 3D-stacked memory, “3D + 2.5D” integrated133systems, and the related work.

1342.1 Background

135Hybrid Neural Network (Hybrid-NN). Hybrid-NN typically136consists of ConvNet, FCNet, and RNN.

137� ConvNet: The main operation of ConvNet is convolu-138tion. In each convolutional layer, a 3D filter is used139to extract features from the input data. The output140feature map is generated by means of sliding the 3D141filter on the input data. For instance, it receives nc

142feature maps as input data and outputs mc feature143maps. Themc feature maps correspond to the filtered144results of mc 3D filters. Each 3D filter is operated

Fig. 1. Examples of hybrid neural networks (Hybrid-NN): AlexNet [1] forimage classification and LRCN [3] for image understanding.

Fig. 2. The temperature affected by memory access behaviors. The datais calculated with CACTI-3DD [22] and HotSpot [23].

2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018

Page 3: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof145 on nc input feature maps to generate one output

146 feature map.147 � FCNet: FCNet mainly performs matrix multiplica-148 tions. The input data include a vector and a weight-149 ing matrix while the output data is a vector.150 � RNN: Generally, Long-Short-Term-Memory (LSTM)151 networks are the best performing RNNs [24]. In152 LSTM-based RNNs, primary operations are full con-153 nection operations like FCNet and element-wise vec-154 tor multiplications that perform the gating function.155 In this paper, we use RNN to represent LSTM-based156 RNN.157 Spatial Architecture.Many prior work [7], [8], [25], [26], [27],158 [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],159 [41], [42], [43], [44], [45], [46], [47], [48], [49] have used spatial160 architectures to accelerate neural networks in ASIC and FPGA161 implementations. Fig. 3 reviews the spatial architecture in Shi-162 DianNao [29], Eyeriss [8], and ENVISION [50]. The key of the163 spatial architecture is the PE array which exploits high com-164 pute parallelism by using direct communication between sim-165 ple PEs. Input, output and weight buffers are designed as the166 scratch-pad-memory (SPM) to save temporal data in the accel-167 eration of hybrid-NN on PE array. The NPU loads inputs and168 weights from DRAM and stores them into input and output169 buffer. The PE array reads them from these buffers, carries out170 the neural computation, and writes outputs back to output171 buffers. The final outputs are transmitted from output buffer172 to DRAM. To meet the high parallel degree of PE array, each173 SPMbuffer is composed ofmultiplememory banks.174 3D Memory and “3D + 2.5D” Integration. 3D memory is a175 promising solution to meet the high capacity and band-176 width demands in the acceleration of high-throughput177 hybrid-NN. In the early days of neuromorphic, 3D wafer178 scale integration(3D WSI) has been used in neural network179 processor design, such as [51]. The work showed that 3D180 WSI technology can provide a hardware medium for con-181 structing massively parallel computing processors like neu-182 ral network accelerator. In more recent applications, 3D183 integration is generally used with memristors in neuromor-184 phic system [52]. This work proposed a 3D synaptic archi-185 tecture based on Ta/TaOx/TiO_2/Ti RRAM for analog186 neuromorphic computation. High-bandwidth Memory187 (HBM) is a typical 3D stacked memory standard proposed188 by JEDEC [53] for high-bandwidth applications. HBM is189 composed of 4 DRAM dies and one single logic die, and190 provides maximum 256 GBps memory bandwidth and 4GB191 memory capacity. There are two major integration solutions192 with 3D memory and NPU: vertical 3D integration and “3D

193+ 2.5D” integration (depicted in Fig. 4). Vertical 3D integra-194tion places the 3D memory directly on top of the processing195unit [13], [14], [54], [55], [56], while “3D + 2.5D” solution196integrates wide-interface 3D memory side-by-side with the197processing unit on a silicon interposer. The latter one has198been proved to be a good solution for high-throughput com-199putation [16], [17], [18], and is explored for neural accelera-200tion recently [19], [57].

2012.2 Other Related Work

202In addition to the related work described in previous subsec-203tion, other relatedwork on ConvNet accelerators, and FCNet/204RNNaccelerators are briefly described in this subsession.205ConvNet Accelerators. In general, ConvNets are computa-206tionally intensive, ConvNet acceleration has focused on207improving the peak performance and the energy efficiency.208Some accelerators [7], [9], [25], [29], [30], [31], [39], [42], [43],209[46] exploit the dataflow to activate more PEs to improve210the computing parallelism. Others [8], [26], [36], [58] exploit211the locality in dataflow to improve energy efficiency.212FCNet/RNN Accelerators. FCNet/RNN acceleration are213usually memory-bounded. Consequently, most of the prior214work of FCNet/ConvNet acceleration focus on quantiza-215tion, sparsity, and 3D stacked design. Previous researches216[7], [9], [28], [40], [59] have retrained the neural network217with a low-precision quantization. Some accelerators [41],218[60] accelerate sparse FCNet with several PEs in SIMD.219Because the sparsity of weight matrix is random, some PEs220are always wasted in acceleration. The accelerators [13],221[14], [19] improve performance of FCNet/RNN with high-222bandwidth 3D memory.223Eyeriss [8], EIE [41], Neurocube [13] and TETRIS [14] are224the most relevant accelerators. Eyeriss is designed with a225optimized dataflow on the PE array and optimizes the226memory accesses with a global buffer. EIE supports sparse227fully connected networks with a SIMD architecture. Differ-228ent from Parana, EIE distributes non-zero weights distrib-229uted in local storage of all PEs. The input neurons are230broadcasted and each PE calculates parts of sparse FC with231its local non-zero weights and the broadcasted inputs.232Because the number of non-zero weights in each PE are dif-233ferent, EIE suffers from the load imbalance. Parana over-234comes the load imbalance by storing input/output neurons235in each PE and averagely spreading non-zero weights236among all PEs. Neurocube integrates 3D memory on the237NPU vertically. TETRIS improves the performance of 3D238stacked NPU by trading off the size of on-chip buffer capac-239ity and the number of used PEs, and also proposes a bypass-240ing scheduling scheme to reduce memory accesses.

Fig. 3. Overview of the spatial architecture. The common spatial archi-tecture is composed of a PE array and three isolated buffers for temporalinput, output, and weight in a neural network.

Fig. 4. “3D + 2.5D” system architecture for neural network acceleration.

YIN ET AL.: PARANA: A PARALLEL NEURAL ARCHITECTURE CONSIDERING THERMAL PROBLEM OF 3D STACKED MEMORY 3

Page 4: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof241 Parana is different from the existing accelerator designs.

242 In purpose, Parana jointly optimizes the total number of243 memory accesses and the memory access pattern, instead of244 merely avoiding masses of memory accesses, so that both245 the steady temperature and the peak temperature can be246 optimized. In methodology, our work not only adopts data247 reuse to reduce the number of memory accesses, but also248 utilizes the complementarity between different NNs to249 smooth the memory access pattern.

250 3 MOTIVATION

251 We motivate our design by discussing the inefficiencies in252 prior NPU designs.253 Thermal issue remains a critical concern towards feasible254 “3D + 2.5D” integrated systems [15]. Higher temperature can255 lead to substantially increased cooling and packaging costs in256 “3D+ 2.5D” systems [61]. Furthermore, the increase of temper-257 ature can degrade DRAM reliability and performance. For258 example, JEDEC stipulates that systems running beyond 85 �C259 need to double memory self-refresh rate [62]; the rate contin-260 ues to double for every � 10 �C degree beyond 85 �C [63].261 Hence, to tackle the thermal problem of 3D memory, we need262 to consider both the total number of memory accesses and the263 memory access behaviors.264 The state-of-the-art NPU designs focus on improving the265 peak performance and the energy efficiency. The design266 choices in state-of-the-art NPUs are not optimized for the267 thermal problem in 3D memory. We summarize three ineffi-268 ciencies in NPU for the thermal problem in “3D + 2.5D” inte-269 gration, as follows:270 First, the time division mapping (TDM) scheme, which is271 widely adopted by the state-of-the-art NPUs [9], [7], [29], can272 generate substantial amount of heat by executing FCNet and273 RNN. Hybrid-NN commonly employs a cascaded network274 architecture. As such, ConvNet, FCNet and RNN are acceler-275 ated one-by-one under TDM. While the memory footprint of276 ConvNet is relatively small, FCNet and RNN are memory277 intensive. As a result, such uneven memory access behavior278 across ConvNet, FCNet and RNN can create hotspots in 3D279 memory of “3D + 2.5D” integrated NPUs, harmful to the peak280 temperature.281 Second, most previous designs can lead to substantial282 underutilization of on-chip storage resources. Most prior spa-283 tial architecture designs employ isolated on-chip buffers to284 store input, output, and weight of a neural network, respec-285 tively [7], [9], [28], [29], [39], [40]. Yet, the memory footprint

286varies across input, output and weight in various layers of287hybrid-NN. Fig. 6 summaries typical input, output andweight288sizes of convolutional layers in hybrid-NNs. Due to the dispa-289rate sizes, the isolated buffer design cannot fully utilize all on-290chip storage resources to reduce 3Dmemory accesses. A recent291study Eyeriss [8] employs a global buffer design to improve292the utilization of on-chip storage resources. Eyeriss also incor-293porates an analysis framework that optimizes system energy294efficiency. However, Eyeriss does not take into account the295thermal issue imposed by 3D + 2.5D integratedNPUs.296Third, the sparse fully connected layers are usually acceler-297ated by the SIMD PEs basedNPUs [41], [60]. So far for the 2-D298PE array based NPU, it has to store the dense representation299in the on-chip buffer by filling the pruning connections in300FCNet and RNN with zero. The dense representation occu-301pies more on-chip buffer,resulting in extra DRAM accesses,302harmful to the steady temperature of 3Dmemory.

3034 PARANA

304In “3D + 2.5D” integration, the proposed neural processor305solves the thermal problem by reducing the total number of306memory accesses and controlling the memory access behav-307iors. This section presents the Parana architecture and the308optimization scheme behind Parana, which reduces the309temperature of 3D memory. We also propose a analysis310framework to fully exploit the advantage of Parana architec-311ture to reduce temperature. In ConvNet, each pooling oper-312ation can be treated as a specific convolution by swapping313MAC in convolution with MAX comparison. In RNN, The314gate indicators are iteratively calculated by such kind of315operations: Gatet ¼ dðWx �Xt þWh �Ht�1 þ biasÞ, where, in316iteration t, Gatet denotes the input gate, forget gate or out-317put gate. Xt and Wx are the corresponding input data and318weighting matrix, respectively. dðÞ represents the nonlinear319operation like sigmoid and tanh. The matrix operation “�”320can be regarded as fully connected operations. Therefore,321we introduce the implementation of ConvNet, FCNet, RNN322on Parana only with convolution computation (CONV) and323fully connected computation (FC).

3244.1 Architecture Overview

325Parana reshapes thememory access behaviorwith spatial divi-326sion mapping (SDM), which is supported by a divisible and327configurable PE array. To reduce the total memory accesses, a328tiling and scheduling mechanism for unified SPM buffer is329used tomaximize the data reuse. Fig. 5 depicts the overview of

Fig. 5. Parana architecture overview.

4 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018

Page 5: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof330 Parana architecture. It consists of four major components: a

331 divisible and configurable PE array, a unified SPM buffer,332 multiple register cache buffers and a dataflow controller.333 The Divisible PE Array. To enable SDM, the PE array is334 allowed to be partitioned into two parts, in which one part335 is used to accelerate CONV and the other part is used to336 accelerate sparse FC. In the divisible PE array, the network337 on-chip (NoC) supports two fixed dataflows: one for CONV338 and the other for sparse FC. Each PE includes one PE con-339 troller (include configuration words), one MAC unit for340 CONV and FC computation, one MAX comparison for pool-341 ing operation and five blocks of local memory (LM1 �342 LM5). To support flexible partitioning solutions in SDM,343 each PE supports two working modes: CONV mode and344 sparse FC mode. The logic resources and local memories in345 the PE are reused for the two modes.346 The Unified SPM Buffer. It is implemented with multiple347 SRAM banks to meet the high parallelism of the PE array and348 occupies an independent address space in the system address349 space. The key of the unified SPM buffer is maximizing the350 data reuse in the acceleration of hybrid-NN. The buffer pro-351 vides inputs and weights for the computation in the PE array352 and stores the outputs from the PE array. Inputs, outputs, and353 weights are saved in different address spaces of the buffer,354 which is calculated by the Parana compiler in the compiling355 phase. For different hybrid-NN layers, the boundaries of dif-356 ferent address spaces are allowed to be shifted to maximize357 the utilization of the SPMbuffer capacity.358 Register Cache Buffers. Different with the SPM buffer, the359 cache buffers do not occupy any space in the system360 address space. The temporal data in register cache buffers is361 the subset of the unified SPM buffer. Each register cache362 buffer is implemented with register files and organized as a363 FIFO. Although the unified SPM buffer is implemented as364 multi-bank which provide multiple data accesses in each365 cycle, it is hard to serve the requests for different types of366 data simultaneously, i.e., inputs/outputs/weights. Register367 cache buffers are used to bridge the dataflow between the368 unified SPM buffer and the PE array. For each access to the369 unified SPM buffer, one register cache buffer is filled up370 with the SPM buffer access and the register cache buffer371 provides the data to the PE array for each cycle. The unified372 SPM buffer services six register cache buffers in time divi-373 sion multiplexing.374 Parana Controller. Parana is a data-driven neural proces-375 sor. The controller is used to configure the PE array, manage

376the unified SPM buffer and schedule the dataflow for effi-377cient hybrid-NN acceleration with a low temperature in 3D378memory. The controller first loads the configuration to con-379figure the data path in PE array. Then, the controller loads380data from 3D DRAM, saves different types of the data in381independent spaces of the unified SPM buffer and feeds the382PE array with the data. When the data is worn out, the accel-383eration is finished.

3844.2 Configurable PE for Both CONV and Sparse FC385Computation

386To provide different partitioning solutions in SDM, each PE387in Parana can be configured into either CONV mode or388sparse FC mode. Fig. 5 depicts the PE architecture, in which389the workingmode of each PE is decided by the configuration390word in the PE and the computation of sparse FC or CONV391is controlled by the PE controller. The local memories and392logic resources in each PE are shared by bothmodes.393In sparse FC mode, each PE accelerates the sparse FC as394depicted in Fig. 7. Fig. 7a shows the orignal representation395of sparse FC. Fig. 7b presents the sparse FC acceleration on396the PE. In sparse FC acceleration, inputs and outputs are397respectively saved in LM1 and LM5 of the PE. LM2, LM3,398and LM4 are implemented by FIFO and used to buffer the399non-zero weights and the corresponding row/column indi-400ces of each non-zero weight. The sparse FC computation on401the PE is performed in four steps, under the control of PE402controller: (1) The PE reads non-zero weight (weight) and its403column/row coordinates from LM3, LM2, and LM4; (2) The404PE uses the column coordinate to index the input operand405(in½col�) in LM1 and uses row coordinate to index the partial406sum (out½row�) saved in LM5; (3) The PE executes multiply-407and-accumulating operation out½row� ¼ in½col� � weight; (4)408The produced partial sum is written back into LM5.409Under convolutional mode, the computing process of410convolution in Parana follows the systolic dataflow [8]. As411depicted in Fig. 8, in each PE, LM1 saves one row of input412feature maps and LM2/LM3/LM4 saves one row of weight413kernels. The logic resources in the PE take the inputs to cal-414culate one row of partial convolutional sums and the PE415saves the partial sums in LM5 to implement accumulation416with future results. The CONV computation in each PE is

Fig. 6. Input, output and weight sizes of various convolutional layers inAlexNet [1] and VGGNet-16 [10].

Fig. 7. An example of sparse FC acceleration on PE.

YIN ET AL.: PARANA: A PARALLEL NEURAL ARCHITECTURE CONSIDERING THERMAL PROBLEM OF 3D STACKED MEMORY 5

Page 6: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof417 performed in three steps, under the control of PE controller:

418 (1) The PE loads one kernel weight (weight) from LM2/419 LM3/LM4, one input (in) from LM1 and one partial sum420 from LM5; (2) The PE executes multiply-and-accumulating421 operation out ¼ in � weight; (3) The produced partial sum is422 written back into LM5.

423 4.3 Divisible PE Array for Spatial Division Mapping424 (SDM)

425 To enable array partitioning, Parana NOC is configurable and426 divisible. To enable CONV and sparse FC acceleration on the427 PE array simultaneously, it provides independent input/out-428 put ports around each edge of PE array and avoids the data-429 path conflicts in the computation of CONV and sparse FC.430 Fig. 9 demonstrates the SDM partitioning and the data-431 flows of hybrid-NN on the partitioned PE array. The demo432 shows an equal partitioning for a CONV sub-array and a433 sparse FC sub-array. The CONV sub-array accelerates Con-434 vNet and the sparse FC sub-array accelerates FCNet and435 RNN. Each sub-array works with three independent regis-436 ter cache buffers (input, output and weight).437 To accelerate the sparse fully connected layers on the438 sparse FC sub-array depicted in Fig. 9, all PEs in the column439 i duplicate the input vector (in_i) and output vector (out_i).440 The non-zero weights are uniformly divided into three vec-441 tors, which are equal to the number of rows in the PE array.442 Each row of the PE array computes with one subvector of443 non-zero weights. The sparse FC acceleration in each PE is444 as the same depicted as Fig. 7b. The dataflow of sparse FC445 computation includes three stages: input operand duplica-446 tion, sparse FC computation, and the accumulation of out-447 puts. At the first stage, input operands are transmitted into448 the PE array from the bottom PEs and saved into LM1 in449 each PE. Input neurons are reused across PEs vertically. At450 the second stage, non-zero weights are uniformly divided451 into three parts and transmitted into the corresponding452 rows of the PE array one by one. Each PE multiples the non-453 zero weight by the saved input operand and accumulates454 the product with the partial sum saved in the PE. At the455 third stage, partial sums for the same output at the same col-456 umn are accumulated and sent out from the PE array.457 The convolution computation with the CONV sub-array458 follows the systolic dataflow in [8]. The CONV sub-array459 loads weights from the PEs at the top edge. The inputs are460 transmitted into the PE array along vertical direction. It461 loads inputs from the PEs at the left and top edges and462 transmits the inputs along diagonal direction. The outputs

463are sent out from left to right and written back to the register464cache buffer.465To illustrate why SDM can reduce memory bandwidth in466the acceleration of hybrid-NN on NPU. We assume that467NPU adopts an optimal scratch-pad-memory buffer (OPT-468Buffer) which fully exploits the locality in neural networks469and 3D memory provides sufficient bandwidth for neural470computation. Table 1 lists all parameters used in this paper.471For the acceleration of ðM NÞ full-connection layer in472FCNet and RNN, M and N are respectively the numbers of473output neurons and input neurons. Assuming the number474of PEs is PEFC , the bandwidth demand of sparse FC is

BWFC ¼ M �N � SPA �DL

ðM �N � SPAÞ=ðPEFCNet=RNN � fÞ ¼ DL � f � PEFCNet=RNN

(1)476476

477in which f is the frequency of the NPU. For the acceleration478of ðR;C;M;N;KÞ convolutional layer in ConvNet, ðR;CÞ479are the size of output feature map, ðM;NÞ are respectively480the number of output and input feature maps, and K is the481kernel size. With PCONV PEs, the inference time of CONV is

TCONV ¼ ðR � C �M �N �K2Þ=ðPEConvNet � fÞ (2)483483

484in which, ðR � C �M �K2Þ is the MAC operations in the con-485volutional layer. The bandwidth demands for input, output,486and weight are respectively

BWInput ¼ R � C �N �DL

T¼ DL � f

M �K2� PEConvNet (3) 488488

489

BWOutput ¼ R � C �M �DL

T¼ DL � f

N �K2� PEConvNet (4) 491491

492

BWWeight ¼ M �N �K2 �DL

T¼ DL � f

R � C � PEConvNet: (5)494494

495Hence, the bandwidth demand of CONV is

BWCONV ¼ BWInput þBWOutput þBWWeight (6)497497

498

¼ 1

M �K2þ 1

N �K2þ 1

R � C� �

�DL � f � PECONV

(7)

500500

501

DL � f � PEConvNet: (8)503503

Fig. 8. An example of CONV acceleration on PE.Fig. 9. The dataflows of CONV and sparse FC in SDM.

6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018

Page 7: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

504 From the Equations (1) and (8), we find that the bandwidth505 demands in CONV is smaller than sparse FC with the same506 row of PEs. Hence SDM, which accelerate ConvNet and507 sparse FC in parallel, is beneficial to reducing the maximum508 bandwidth demand in NPU and the peak temperature in 3D509 memory.510 Different partitioning solutions of the PE array and the511 SPM buffer leads to different thermal and performance512 effects. Section 5.2 introduces the optimization framework513 which achieves the most profitable spatial allocation for the514 lowest peak temperature without performance degradation.

515 5 OPTIMIZATION FRAMEWORK

516 To accelerate a certain hybrid-NN on Parana, we propose a517 joint spatial resource partitioning and tiling/scheduling518 mechanism to minimize both peak temperature and steady519 temperature of 3D memory. The spatial resource partitioning520 can reshape the 3D memory access behavior and results in521 even distribution of memory access over time, which can522 lower peak temperature. The tiling and scheduling of hybrid-523 NN can improve the utilization of on-chip buffer and reduce524 the number of 3D memory accesses, which can lower steady525 temperature.526 We first introduce the analysis model to achieve the most527 profitable tiling/scheduling of hybrid-NN and resource528 partitioning of PE array. Then we present the compiling529 workflow and the execution workflow with Parana.

5305.1 Buffer Oriented Tiling and Scheduling to531Minimize 3D Memory Accesses

532Because the configuration of different layers is largely differ-533ent in hybrid-NN, we analyze the tiling and scheduling534scheme for CONV and sparse FC to maximize the utilization535of the unified SPM buffer and data reuses in hybrid-NN. In536previous NPUs, tiling is usually used to extract a sub-CONV537or sub-FC which can be mapped on the PE array to optimize538the computation flow [9], [39]. For Parana, we propose a539buffer-oriented tiling method. It tiles the sub-CONV and sub-540FC of hybrid-NN to match the size of SPM buffer. For a541CONV layer (R;C;M;N) and the kernel size isK, we tile the542CONV layer with parameters (Tr; Tc; Tm; Tn) to ensure that543the unified SPM buffer can save the all inputs/outputs/544weights in the tiled convolution. A high utilization of the uni-545fied SPM buffer capacity can be achieved by optimizing the546tiling parameters.547Scheduling is used tomaximize the data reuse between the548tiled neural network layers. There are three classic reuse pat-549terns used by state-of-the-art NPUs [7], [26], [28], [29], [43],550including input reuse (ir), weight reuse (wr) and partial sum551reuse for output (or). Take the input reuse of CONV as an552example, the inputs in one tiled sub-CONV keep being on the553chip until all tiled sub-CONVs which need the inputs have554been finished by the NPU. The input reuse is suitable to the555layer in which the memory footprint of inputs is larger than556outputs and weights. Because different layers own different557memory footprints of inputs, outputs, and weights, one reuse558pattern cannot suit to all layers in hybrid-NN. A dynamic559scheduling mechanism for different layers is beneficial to560maximize the data reuse in hybrid-NN.561To achieve the most profitable tiling and scheduling for562each CONV and fully connected layer, we setup the models563for CONV and FC to analyze the memory accesses. In the564models, Parana works in batch mode. For ConvNet, Parana565accelerate a group of inputs in each inference one by one.566For FCNet and RNN, a batch of inputs are processed567together. The tiling and scheduling models for one CONV568and a batch of FCs are discussed as follows.569Tiling and Scheduling Models for One CONV. Convolu-570tional (CONV) tiling is used to select the most profitable571parameters (Tr; Tc; Tm; Tn), which allow the tiled convolu-572tion to be saved in SPM buffer directly. With these tiling573parameters, we calculate SPM demands for CONV inputs,574outputs, and weights, as follows:

SPMCONV�i ¼ Tr � Tc � Tn (9)

576576

577

SPMCONV�o ¼ Tr � Tc � Tm (10)

579579

580

SPMCONV�w ¼ Tm � Tn �K �K: (11)582582

583Under the tiling mode, Parana completes the CONV com-584putation by loading different tiled convolutions Rpt times,585and the Rpt is

Rpt ¼ R

Tr

� �� C

Tc

� �� M

Tm

� �� N

Tn

� �: (12)

587587

588

589In a simple scheduling without data reuses between tiled590convolutions, the total number of 3D memory accesses is591the product of SPM buffer size (SPM for output need to be592calculated two times for loading the partial sum and writing

TABLE 1NotationDescription

Hybrid-NN Parameters

B,DL Batch size and data length ofhybrid-NN

(R;C;M;N),K, LConvNet Parameters of CONV: (R;C) arerows and columns of featuremaps,M is the number of outputmaps, N is the number of inputmaps,K is the kernel size, LConvNet

is the number of layers inConvNet

(I;O), SPA, LFCNet=RNN Parameters of FC: I is the numberof input neurons, O is the numberof output neurons, SPA is thedegree of sparsity, LFCNet is the

number of layers in FCNet/RNN

Architecture Configuration

SPM Storage size of the unified SPMbuffer.

PE the number of PEs in the PE array.

Partition

SPMConvNet, SPMFCNet=RNN The SPM buffer size of ConvNetand FCNet/RNN in SDM

PEConvNet , PEFCNet=RNN The number of PEs for ConvNetand FCNet/RNN in SDM

Tiling and Scheduling

(Tb; Tr; Tc; Tm; Tn) Tiling parameters for CONV(Tb; Ti; To) Tiling parameters for FCCONV�ir, FC�ir reuse inputsCONV�wr, FC�wr reuse weightsCONV�or, FC�or reuse partial sums

YIN ET AL.: PARANA: A PARALLEL NEURAL ARCHITECTURE CONSIDERING THERMAL PROBLEM OF 3D STACKED MEMORY 7

Page 8: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

593 back the updated partial sum) and Rpt. Efficient scheduling594 reduces the 3D memory accesses by fixing one kind of data595 (e.g., tiled input, weight or the partial sum of output) in596 SPM buffer and reuses the data for different tiled CONV597 computation. Hence, the total number of memory accesses598 under input reuse, partial sum reuse and weight reuse599 scheduling are respectively

MACONV�ir ¼ R � C �N þ ð2 � SPMCONV�o þ SPMCONV�wÞ �Rpt(13)

601601

602

MACONV�or ¼ R � C �M þ ðSPMCONV�i þ SPMCONV�wÞ � Rpt(14)

604604

605

MACONV�wr ¼ M �N �K �K þ ðSPMCONV�i þ 2 � SPMCONV�oÞ �Rpt(15)

607607

608 in which, the first term of each equation is the original data609 size for input, output, and weight and other terms are calcu-610 lated by multiplying the SPM buffer size by Rpt.611 The minimal number of 3D memory accesses for the612 CONV layer is achieved by resolving the following optimi-613 zation problem:

min MACONV ¼ minðMACONV�ir;MACONV�or;MACONV�wrÞ(16)615615

616 s.t.

SPMCONV�i þ SPMCONV�o þ SPMCONV�w SPMConvNet

(17)618618

619 By enumerating all possible tiling under the three reuse pat-620 terns, the Parana complier achieves the most profitable tiling621 and scheduling configuration for the CONV layer with lim-622 ited SPM buffer size. The total amount of 3D memory access623 in ConvNet can be calculated with the MACONV of all CONV624 layers.625 Tiling and Scheduling Models for a Batch of FCs. The tiling626 and scheduling model in FC is similar to CONV. The SPM627 storage demands of input, output, and weight for FC are628 respectively

SPMFC�i ¼ Ti � Tb (18)

630630

631

SPMFC�o ¼ To � Tb (19)

633633

634

SPMFC�w ¼ Ti � To � SPA � 3 (20)636636

637 in which SPA is the degree of sparsity in FC and the factor638 of 3 is the storage cost for the representation of sparse FC.639 The repeating times for tiled FCs computation is

Rpt ¼ B

Tb

� �� I

Ti

� �� O

To

� �: (21)

641641

642 The total number of 3D memory accesses under input reuse,643 partial sumof output reuse andweight reuse are respectively

MAFC�ir ¼ I �Bþ ð2 � SPMFC�o þ SPMFC�wÞ �Rpt (22)645645

646

MAFC�or ¼ O �Bþ ðSPMFC�i þ SPMFC�wÞ �Rpt (23)648648

649

MAFC�wr ¼ I �O � SPA � 3þ ðSPMFC�i þ 2 � SPMFC�oÞ �Rpt:(24)651651

652The optimization problem for the tiling and scheduling of653FC is

min MAFC ¼ minðMAFC�ir;MAFC�or;MAFC�wrÞ (25)655655

656s.t.

SPMFC�i þ SPMFC�o þ SPMFC�w SPMFCNet=RNN: (26)658658

659With MAFC of all full-connection layers, the complier can660calculate the number of 3D memory accesses for FCNet and661RNNwith assigned SPM storage.

6625.2 Spatial Resource Partitioning to Minimize663Peak Bandwidth

664By partitioning the PE array and SPMbuffer,multipleNNs can665be accelerated on Parana in parallel. Then the neutralization of666peak bandwidth of different NNs results more smooth mem-667ory access behavior. To find the best resource partitioning, we668formulate an bi-objective optimization problem. Because669FCNet and RNN aremainly compose of FC layers, in the oper-670ation problem we calculate the inference time of FCNet and671RNNby summing up the inference time of all FC layers.672For the CONV layer (R;C;M;N;K), the memory access673under the most profitable tiling scheduling is MACONV ,674which has been discussed in Section 5.1. Assuming PECONV

675PEs are assigned for CONV layer, the inference time of676CONV layer is

TCONV ¼ ðR � C �M �N �K2Þ=ðPEConvNet � fÞ: (27)678678

679Hence, the bandwidth demand of the CONV layer is

BWCONV ¼ MACONV =TCONV: (28) 681681

682

683Assuming PEFCNet=RNN PEs are assigned for the sparse684FC/RNN layer, the inference time is

TFC ¼ ðM �N � SPAÞ=ðPEFCNet=RNN � fÞ: (29)686686

687Hence, the bandwidth demand of the FC layer is

BWFC ¼ MAFC=TFC: (30) 689689

690

691At the time of t, CONV layerCONV ðtÞ and fully connected692layer FCðtÞ are simultaneously accelerated by the PE array.693The transient bandwidth BWt is the sum of the BWCONV ðtÞ694andBWFCðtÞ.695To reduce the peak bandwidth in hybrid-NN accelera-696tion, we formulate the optimization problem as follows:

min THybrid�NN ¼ maxðTConvNet; TFCNet þ TRNNÞ (31)

698698

699

min maxfBWtj0 t THybrid�NNg (32)701701

702s.t.

PEConvNet þ PEFCNet=RNN PE (33)

704704

705

SPMConvNet þ SPMFCNet=RNN SPM; (34)707707

708where THybrid�NN is the inference time of the whole hybrid-709NN and TConvNet=TFCNet=TRNN= are inference times for Con-

710vNet/FCNet/RNN. They are

TConvNet ¼XLConvNet

l¼1

TCONV ½l� (35)712712

8 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018

Page 9: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

713 TFCNet ¼XLFCNet

l¼1

TFC ½l� (36)715715

716

TRNN ¼XLRNN

l¼1

TFC ½l�: (37)

718718

719 Because ConvNet and FCNet/RNN are accelerated in paral-

lel, the inference time of the hybrid-NN is the larger one

between them. In the optimization problem, the first goal isthe performance optimization in Equation ð31Þ which mini-

mizes the inference time of the whole hybrid-NN. The sec-

ond goal is to minimize the maximum bandwidth demands

in hybrid-NN acceleration.720 This optimization problem needs to be jointly resolved721 with the optimization problems of tiling and scheduling for722 CONV and FC in Section 5.1. By resolving this optimization723 problem, we get the best PE array partitioning and SPM724 buffer allocation for CONV layers and FC layers.

725 5.3 Workflow of Parana

726 Fig. 10 summaries the workflow for the compiler and execu-727 tion of Parana. In compiling workflow, the compiler opti-728 mizes the resource partitioning (e.g., PE array and SPM729 buffer) and the scheduling for ConvNet, FCNet, and RNN. It730 loads the parameters of hybrid-NN (converted from Caffe731 protext) and Parana configuration as input to construct the732 scheduling framework for themost profitable resources parti-733 tion and execution scheduling. The key of scheduling frame-734 work are three models: resources partitioning model for735 SDM, tiling and scheduling model for ConvNet, and tiling736 and scheduling model for FCNet. The SDM partitioning737 model is used to guide the partitioning of computing and738 memory resources in Parana. With assigned memory resour-739 ces, the tiling and scheduling models for ConvNet, FCNet,740 and RNN are exploited to maximize usage of assigned mem-741 ory and data reuse in each neural network. With the schedul-742 ing framework, the compiler produces the most profitable743 configuration for hybrid-NN acceleration on Parana.744 In execution workflow, Parana first loads the most profit-745 able configuration. Then, Parana controller uses the partition-746 ing configuration to configure the partitioning in the PE array747 and the function of each PE. The controller also manages the748 memory assignment of the SPM buffer with the SPM buffer749 partitioning configuration. The scheduling configuration is750 used to control the dataflow betweenNPUs and 3DDRAM.

7516 EVALUATION

752In this section, we present the experimental setup and the753evaluation results on thermal, performance, and energy754improvement.

7556.1 Experimental Setup

756Benchmarks. We select three representative neural networks,757AlexNet [1], VGGNet-16 [10], and LRCN [3], as the hybrid-758NNbenchmarks in our evaluation. The three benchmarks rep-759resent various mixes of neural network layers: 1) VGGNet-16760is CONV computation dominant, 2) LRCN is FC computation761dominant, 3) AlexNet has similar CONVandFC computation.762Table 2 lists the network structures of our benchmarks, along763with the sparsities of FCNet and RNN.764Simulation Framework. We evaluate the temperature of 3D765DRAM in the “3D + 2.5D” integration system on a simulator,766which integrates a cycle-accurate NPU + HBM simulator, a767power model from CACTI-3DD [22], and a thermal model768fromHotSpot 6.0 [23]. The configuration of 3D DRAM follows769the High-Bandwidth-Memory stacked memory [64]. The770DRAM controller adapts the open-page policy, because of the771high row buffers hit rate. CACTI-3DD is used to generate772the detail energy costs of HBM access. The NPU +HBM simu-773lator takes these energy costs and other parameters of the774accelerator, HBM and hybrid-NN to produce the power traces775for all modules in the 3D DRAM. HotSpot 6.0 analyzes the776temperaturewith power traces as input.777Table 3 lists the configuration of the simulation. In the con-778figuration, Parana works at 1 GHz and the size of PE array is77932 32. It is the maximum size which can be supported by780the peak bandwidth of 3Dmemory [64], when Parana is used781to accelerate memory intensive FCNet and RNN. Each PE782uses 16-bit fixed-point arithmetic units. The quantized repre-783sentation of input and output include: 1 bit signed bit, 7 bits784integer, and 8 bits fractional parts. The representation of785weight includes: 1 bit signed bit and 15 bits fractional parts.

Fig. 10. Compiler workflow and execution workflow.

TABLE 2Description of the Benchmarks

Hybrid-NNs ConvNet+FCNet+RNN Spa-FCNet Spa-RNN

AlexNet [1] 5+3+- 10.12% -VggNet-16 [10] 16+3+- 4.63% -LRCN [3] 5+3+3 10.12% 11.2%

YIN ET AL.: PARANA: A PARALLEL NEURAL ARCHITECTURE CONSIDERING THERMAL PROBLEM OF 3D STACKED MEMORY 9

Page 10: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

786 The unified SPM buffer includes 64 banks and each bank pro-787 vides one 16 bits read/write port. Unless otherwise specified,788 the default SPMbuffer size is 256 KB.789 State-of-the-Art NPU Baselines.We compare our design790 with four state-of-the-art NPUs, including Neurocube [13],791 EIE [41], Eyeriss [8], and TETRIS [14], which are representa-792 tive NPU designs. Table 4 summarizes the features of each793 NPU. In particular, TETRIS [14] supports three scheduling794 schemes, including output bypassing (IW), input bypassing795 (OW), and weight bypassing (IO); we select the one that796 generates the lowest number of memory accesses.797 To make this a fair comparison, we analyze key schemes798 adopted by state-of-the-art NPUs and implement each of799 the schemes in our simulator. We perform our evaluation800 with various schemes under the same chip configuration801 listed in Table 3.

802 6.2 Chip Area and Power

803 We implement our Parana architecture design with the con-804 figuration in Table 3 and synthesize our design in Taiwan805 Semiconductor Manufacturing Company (TSMC) 28 nm high

806performance compact mobile computing (HPC) technology.807For a high working frequency, the function of each PE is808implemented with at most six-stage pipelines, where local809memory reading takes two stages,multiplier takes two stages,810adder takes one stage and local memory writing takes one811stage. From the synthesized results, we observe that the delay812of the critical path in PE array is 0.96 ns. The delay of the criti-813cal path in 256 KB SPMwith 64 banks is 0.85 ns. Therefore, the814highest frequency of Parana is 1 GHz. Table 5 breakdown815power and area of Parana with key components under 0:9V816working voltage. The results show that the most power and817area are dominated by the PE array,which takes over 85.5 per-818cent power and 89.8 percent area. Our NPU has a total power819of 4.9 W. The power is much lower than the power of the 3D820DRAM,which is 13.6Was calculated using CACTI-3DD [22]).

8216.3 Analysis on Temperature, Energy, and822Performance

823We evaluate Parana against the simulators of Neurocube,824Eyeriss, EIE and TETRIS on all hybrid-NNs listed in Table 2.825The required bandwidth (GB/s) for AlexNet, VGG and826LRCN are 26.92 � 119.07, 24.01 � 128.00 and 63.16 � 126.53,827respectively. Figs. 11a, 11b, 11c, and 11d respectively show828the results on peak temperature, steady temperature, energy829costs and the inference time.830As shown in Figs. 11a and 11b, Parana achieves both lower831peak temperature and lower steady temperature than Neuro-832cube, Eyeriss, EIE and TETRIS. This is a combined effect of833sparsity, SDM and the unified SPM buffer with most profit-834able tiling and configuration. Sparsity and the unified SPM835buffer reduce the number of memory accesses, so that the836steady temperature is lowered. SDM optimizes the memory837access behavior, so that the peak temperature is further838reduced. If Parana employs TDM instead of SDM, the peak839temperature would dramatically rise, because TDM cannot

TABLE 3Configurations of the Simulation

Parana

Frequency 1 GHzPE Array Size 32 32MAC Precision 16-bit Fixed Point# of Banks in SPM Buffer 64SPM Buffer Size 256 KBEach Cache Buffer Size 1 KB

HBM [64]

Process Technology 29 nm DRAM processCapacity 8 GbChip Size 5.10 mm 6.91 mm# of Stack 4 memory dies + 1 logic dieTSV IO 1024Peak Bandwidth 128 GB/sSupply Voltages VDD = 1.2 V, VPP = 2.5 V

Energy [22] and Thermal ParametersActivation Energy 3.65 nJRead/Write Energy 10.11 nJPrecharge energy 3.44 nJTSV Energy 0.57 nJLogic die Energy 18.52 nJAmbient Temperature 318.15 (Kelvin)

TABLE 4Features of Each NPU in Comparison

NPUs SDM Sparse U-SPM Scheduling

Neurocube [13] No No No NoEIE [41] No Yes No NoEyeriss [8] No No Yes R-stationaryTETRIS [14] No No Yes IW, OW, or IOParana + TDM No Yes Yes OPTParana Yes Yes Yes OPT

Fig. 11. Peak/steady temperature, energy and inference time of Neurocube [13], Eyeriss [8], EIE [41], TETRIS [14], Parana+TDM, and Parana, eval-uated on benchmarks AlexNet [1], VGGNet-16 [10] and LRCN [3].

10 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018

Page 11: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof840 evenly distribute memory access over time. Compared to

841 TDM, SDM leads to a bit higher steady temperature, because842 the granularity of resource partition in SDM cannot be infi-843 nitely fine and that brings a little loss.844 For the peak temperature in Fig. 11a, Neurocube and EIE845 both reach the the highest peak temperature, and Eyeriss with846 the unified SPM buffer only reduces the temperature by 7.66847

�C. Parana achieves the lowest peak temperature, which is848 benefited largely from SDM. This is because that the peak849 temperature is more related to the maximum bandwidth.850 SDMmerges ConvNet and FCNet/RNN works together and851 efficiently reduces the maximum bandwidth by 54.16, 72.5852 and 26.4 percent for AlexNet, VGGNet-16 and LRCN respec-853 tively. In the acceleration of FCNet andRNN, bothNeurocube854 and EIE work with the maximum bandwidth and the maxi-855 mum bandwidth leads to the peak temperature. Although856 sparsity supporting in EIE can effectively reduce the total con-857 sumed energy as shown in Fig. 11c, EIE still uses the full 3D858 DRAM bandwidth to load the data for FCNet/RNN. Since859 the unified SPM buffer can exploit the data reuse in hybrid-860 NN, Eyeriss with the unified SPM buffer can reduce themaxi-861 mumbandwidth and reduce the peak temperature. However,862 the row stationary scheduling in Eyeriss cannot exploit the863 full capacity of SPM buffer. Parana improves the scheduling864 with the most profitable tiling and scheduling for each layer865 and also reduce the peak temperature. Table 6 takes VGGNet-866 16 as an example to show themost profitable tiling scheduling867 configurations in Parana.868 For the steady temperature in Fig. 11b, Neurocube achieves869 the highest temperature for all evaluated benchmarks. The870 reduction of steady temperature in Parana, EIE, and Eyeriss is871 benefited from the energy reduction. Different with the peak872 temperature, results in Figs. 11b and 11c show that the trend873 of steady temperature follows the trend of the energy con-874 sumption. Compared to Eyeriss, EIE saves more energy by875 pruning the useless connections in neural networks. Parana876 achieves more energy reduction than EIE, because the unified877 SPM buffer with the most profitable tiling and scheduling878 reduces memory accesses further. Due to our tiling and

879scheduling mechanisms, Parana substantially reduces system880energy consumption comparedwith TETRIS.881Fig. 11d shows the average inference time per image for882different NPU designs. Because 3D DRAM provides enough883bandwidth for PE computation, Neurocube and Eyeriss884complete each benchmark with the similar time. EIE and885Parana achieve the performance improvement benefited886from the sparsity. EIE achieves a higher performance than887Parana, because input neurons are also sparse in EIE. In888“3D+2.5” integration system, the most energy is consumed889by the intensive memory accesses. For an inference of890hybrid-NN, the energy cost (or memory access) is more891related to the optimization schemes. In contrast, the infer-892ence time is largely affected by the number of PEs.893To evaluate the effect of each individual optimization894scheme, i.e., SDM, sparsity supporting, unified SPM and895data-reuse scheduling, we remove arbitrary scheme from896Parana and compare their temperature reduction to Par-897ana. Fig. 12 depicts the results. The baseline is the steady898state temperature reduction achieved by Parana with all899optimizations, which is set to 100 percent in the compari-900son. Benefits losses are used to evaluate each optimiza-901tion and the higher benefits losses mean that the better902effect of the removing optimization scheme. The results903show that the scheduling mechanism and the sparsity904supporting contribute more temperature reduction than905SDM and the unified SPM buffer. Removing the schedul-906ing mechanism can bring in the benefits losses over 50907percent. Removing SDM does not impact the steady state908temperature, because SDM is designed to reduce peak909temperature.

9106.4 Peak Temperature under SDM and TDM

911Fig. 13 depicts the peak temperature for each layer of the three912benchmarks under temporal-division-mapping and spatial-

TABLE 5Power and Area Breakdown of Parana

Components power (mW) area (mm2)

Controller 10.332 245554.9538SPM buffer 662.592 1544372.4288Reg cache buffers 9.396 10642.4638PE array 4037.12 15866265.0882Total 4719.44 17666834.9346

TABLE 6The Most Profitable Tiling and Scheduling for VGG16

CONV 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 FC 17 18 19

Tr 224 224 112 112 56 56 56 56 28 28 28 28 1 1 1 1 Tb 32 32 64Tc 8 8 8 8 14 14 14 14 28 28 28 28 1 1 1 1 Ti 1 1 1Tm 64 64 128 128 128 128 128 128 128 128 128 128 76 76 76 76 To 410 410 200Tn 1 1 1 1 1 1 1 1 1 1 1 1 171 171 171 171 - - - -

OPT-Reuse OR IR OR OR OR OR OR OR OR OR OR OR WR WR WR WR - OR OR OR

IR: input reuse, WR: weight reuse, OR: output reuse.

Fig. 12. Benefits losses caused by removing arbitrary scheme in Parana(the higher the better).

YIN ET AL.: PARANA: A PARALLEL NEURAL ARCHITECTURE CONSIDERING THERMAL PROBLEM OF 3D STACKED MEMORY 11

Page 12: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

913 division-mapping. The three curves in TDM show that Con-914 vNet works at a low temperature and FCNet/RNNwork at a915 high temperature. The ConvNets are computation-intensive916 (e.g., layers [1 � 5] of AlexNet and LRCN, layers [1 � 16] of917 VGGNet-16 in Fig. 13), hence ConvNet can fully exploit all918 PEs in Parana with only a small part of the 3D DRAM band-919 width and the low bandwidth leads to a lowworking temper-920 ature. FCNet/RNN are memory intensive. They largely921 increase the bandwidth demands of Parana and result in a922 much higher temperature than ConvNet. In contrast, SDM923 partitions the PE array into two parts, which are respectively924 used to accelerate ConvNet and FCNet/RNN. Hence, SDM925 merges the bandwidth demands of ConvNet and FCNet/926 RNN into a balanced one. For SDM,whichmay parallelly exe-927 cute more than one layers, each layer’s peak temperature in928 Fig. 13 represents themaximal temperature in this layer’s exe-929 cution period, and it may include the contribution of other930 layers. The curves in SDM demonstrate that SDM can pro-931 duce a lower peak temperature than TDM on all evaluated932 benchmarks.

933 6.5 3D Memory Access Analysis on the Most934 Profitable Data Reuse Scheduling Scheme

935 Fig. 14 compares the data-reuse scheduling schemes on936 three benchmarks and three classic layers in VGGNet16937 (e.g., convolutional layer 1, 4, and 16): layer 1 represents938 output dominated layer, layer 4 represents both input and939 output dominated layer, layer 16 represents the weight940 dominated layer. In the comparison, the tiling parameters941 are the most profitable configuration. The results show that942 the scheduling of no data reuse produces the most memory

943accesses and the most profitable scheduling in Parana944always achieves the least memory accesses. For some spe-945cific layers, input reuse (IR), weight reuse (WR) and partial946sum of output reuse (OR) can obtain the least memory947accesses as the most profitable scheduling. But for any948hybrid-NN, single reusing pattern cannot always achieve949the least memory access in all layers. For example, OR is950suitable to VGG16-CONV-1 and VGG16-CONV-4, but it is951better to adopt weight reuse in VGG16-CONV-16. Hence952the results demonstrate that the most profitable scheduling953for each layer is better than single reusing pattern for all954layers. In “3D + 2.5D” integration system, a small number955of memory accesses is still beneficial to reduce the tempera-956ture in 3D DRAM.

Fig. 13. Temperature of each layer in AlexNet, VGGNet and LRCN onParana with different mapping strategy (SDM and TDM).

Fig. 14. The 3D DRAM accesses normalized to Parana. (NO-R: noreuse, IR: input reuse, WR: weight reuse, OR: output reuse, Parana: themost profitable reuse scheduling)

Fig. 15. Sensitivity analysis on the number of PEs and the size of the uni-fied buffer, evaluated on LRCN.

12 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018

Page 13: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

957 6.6 Sensitivity Study

958 Fig. 15 depicts the sensitivity analysis of steady temperature959 on different number of PEs and different size of SPM buffer.960 The results on different number of PEs show that the steady961 temperature increases quickly with a larger PE array. A larger962 PE arrays means more concurrent 3D memory accesses, then963 the higher temperature is caused by producing the same964 energy in a smaller period of time. The results on different965 size of SPM buffer show that a large unified buffer is benefi-966 cial to reduce the steady temperature. This is because a larger967 unified buffer increases the data reusability, then reduces the968 total 3D memory accesses. Fewer memory accesses produce969 less energy consumption and finally reduce the temperature.970 As the size of SPM buffer increasing, the benefits are becom-971 ing smaller. In the sensitivity analysis, Parana always achieves972 the lowest temperature among variousNPUdesigns.

973 7 CONCLUSION

974 This paper proposed a “3D + 2.5D” integration neural proces-975 sor with 3D-stacked memory for hybrid-NN acceleration.976 Three key optimization methods are used to tackle the ther-977 mal problem in 3DDRAM: spatial divisionmapping, sparsity978 supporting, and buffer-oriented tiling/scheduling. Parana979 exploits sparsity, unified SPM and tiling/scheduling to mini-980 mize 3D DRAM accesses and reduce the steady temperature981 in 3D DRAM. The spatial division mapping achieves a bal-982 anced bandwidth and reduces the peak temperature by accel-983 erating convolution and sparse fully connected layers in984 parallel. The experimental results show that the proposed985 neural processor reduces the peak temperature by up to986 54.72 �C and the steady temperature by up to 32.27 �C over987 state-of-the-art accelerators with 3Dmemory.

988 ACKNOWLEDGMENTS

989 This work was supported in part by the NSFC under Grant990 61774094, the ChinaMajor S&T Project (No. 2018ZX01031101-991 002) and theNSF under Grant 1718158/1725447/1719160.

992 REFERENCES

993 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-994 cation with deep convolutional neural networks,” in Proc. Adv.995 Neural Inf. Process. Syst., 2012, pp. 1097–1105, 2012.996 [2] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu,997 A. Pritzel, and D. Wierstra, “Pathnet: Evolution channels gradient998 descent in super neural networks,” arXiv preprint arXiv:1701.08734, 2017.999 [3] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,

1000 S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent1001 convolutional networks for visual recognition and description,” in1002 Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2015, pp. 2625–2634.1003 [4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,1004 S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture1005 for fast feature embedding,”ACMMultimedia, 2014, pp. 675–678.1006 [5] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights1007 and connections for efficient neural networks,” Neural Inf. Process.1008 Syst., 2015, pp. 1135–1143.1009 [6] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compress-1010 ing deep neural network with pruning, trained quantization and1011 huffman coding,” Int. Conf. Learn. representations, 2016.1012 [7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,1013 “Diannao: A small-footprint high-throughput accelerator for ubiq-1014 uitous machine-learning,” ACM Sigplan Notices, vol. 49, pp. 269–1015 284, 2014.1016 [8] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-1017 efficient reconfigurable accelerator for deep convolutional neural1018 networks,” in Proc. IEEE Int. Solid-State Circuits Conf., 2016,1019 pp. 262–263.

1020[9] Z. Chen, F. Zhenman, Z. Peipei, P. Peichen, and C. Jason, “Caffeine:1021Towards uniformed representation and acceleration for deep convo-1022lutional neural networks,” in Proc. Int. Conf. Comput. Aided Des.,10232016, Art. no. 12.1024[10] K. Simonyan and A. Zisserman, “Very deep convolutional1025networks for large-scale image recognition,” in Proc. Int. conf. Learn.1026Representations, vol. abs/1409.1556, 2015.1027[11] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar,1028L. Jones, and J. Uszkoreit, “One model to learn them all,” arXiv1029preprint arXiv:1706.05137, 2017.1030[12] G. H. Loh, “3d-stacked memory architectures for multi-core pro-1031cessors,” in Proc. 35th Annu. Int. Symp. Comput. Archit., 2008,1032pp. 453–464.1033[13] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopad-1034hyay, “Neurocube: A programmable digital neuromorphic1035architecture with high-density 3d memory,” in Proc. ACM/1036IEEE 43rd Annu. Int. Symp. Comput. Archit., Jun. 2016, pp. 380–1037392.1038[14] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:1039Scalable and efficient neural network acceleration with 3d memo-1040ry,” in Proc. 22nd ACM Int. Conf. Architectural Support Program.1041Lang. Operating Syst., 2017, pp. 751–764.1042[15] Y. Xie and J. Zhao, “Die-stacking architecture,” in Synthesis Lectures1043on Computer Architecture, vol. 2. San Rafael, CA, USA: Morgan &1044Claypool Publishers, Jun. 2015, pp. 1–127.1045[16] AMD, “Amd’s next generation gpu and memory architecture - hot1046chips,” in Proc. Hot Chips Symp., 2015, pp. 1–26.1047[17] NVIDIA, “Pascal gpu architecture.” 2018. [Online]. Available:1048https://www.nvidia.com/en-us/data-center/pascal-gpu-1049architecture/1050[18] Intel, “Knights landing (knl): 2nd generation intel xeon phi pro-1051cessor,” in Proc. Hot Chips, 2015, pp. 1–24.1052[19] Intel, “Intel Nervana Neural Network Processor” 2018.1053[Online]. Available: https://ai.intel.com/intel-nervana-neural-1054network-processor1055[20] A. Das, A. Kumar, and B. Veeravalli, “Temperature aware energy-1056reliability trade-offs for mapping of throughput-constrained1057applications on multimedia mpsocs,” in Proc. Conf. Des., Autom.1058Test Eur., 2014, Art. no. 102.1059[21] I. Thomson, “Meet TPU 3.0: Google teases world with latest math1060coprocessor for AI,” 2018. [Online]. Available: https://www.1061theregister.co.uk/2018/05/09/google_tpu_3/1062[22] K. Chen, S. Li, N. Muralimanohar, J. H. Ahn, J. B. Brockman, and1063N. P. Jouppi, “Cacti-3dd: Architecture-level modeling for 3d die-1064stacked dram main memory,” in Proc. Des. Autom. Test Eur. Conf.1065Exhib., Mar. 2012, pp. 33–38.1066[23] R. Zhang, M. R. Stan, and K. Skadron, “Hotspot 6.0: Validation,1067acceleration and extension,” University of Virginia, Charlottes-1068ville, VA, USA, Tech. Rep. CS-2015-04, 2015.1069[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”1070Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.1071[25] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A1072dynamically configurable coprocessor for convolutional neural1073networks,” in Proc. ACM SIGARCH Comput. Archit. News, 2010,1074pp. 247–257.1075[26] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal, “Memory-1076centric accelerator design for convolutional neural networks,” in1077Proc. IEEE 31st Int. Conf. Comput. Des., 2013, pp. 13–19.1078[27] F. Conti and L. Benini, “A ultra-low-energy convolution engine1079for fast brain-inspired vision in multicore clusters,” in Proc. Des.1080Autom. Test Eur. Conf. Exhib., 2015, pp. 683–688.1081[28] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J.Wang, L. Li, T. Chen, Z. Xu,1082N. Sun, et al., “Dadiannao: A machine-learning supercomputer,” in1083Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2014,1084pp. 609–622.1085[29] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng,1086Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing1087closer to the sensor,” ACM SIGARCH Comput. Archit. News,1088vol. 43, pp. 92–104, 2015.1089[30] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen,1090“Cambricon: An instruction set architecture for neural networks,”1091in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit., Jun.10922016, pp. 393–405.1093[31] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-brain: A1094deep learning accelerator that tames the diversity of cnns through1095adaptive data-level parallelization,” in Proc. 53nd ACM/EDAC/1096IEEE Des. Autom. Conf., Jun. 2016, pp. 1–6.

YIN ET AL.: PARANA: A PARALLEL NEURAL ARCHITECTURE CONSIDERING THERMAL PROBLEM OF 3D STACKED MEMORY 13

Page 14: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

1097 [32] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee,1098 J. M. Hernandez-Lobato , G. Y. Wei, and D. Brooks, “Minerva:1099 Enabling low-power, highly-accurate deep neural network accel-1100 erators,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit.,1101 2016, pp. 267–278.1102 [33] S. Park, S. Choi, J. Lee, M. Kim, J. Park, and H. J. Yoo, “14.1 a1103 126.1mw real-time natural ui/ux processor with embedded deep-1104 learning core for low-power smart glasses,” in Proc. IEEE Int.1105 Solid-State Circuits Conf., Jan. 2016, pp. 254–255.1106 [34] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and1107 D. S. Modha, “A digital neurosynaptic core using embedded cross-1108 bar memory with 45 pj per spike in 45 nm,” in Proc. IEEE Custom1109 Integr. Circuits Conf., Sep. 2011, pp. 1–4.1110 [35] J. S. Seo, B. Brezzo, Y. Liu, B. D. Parker, S. K. Esser, R. K. Montoye,1111 B. Rajendran, J. A. Tierno, L. Chang, D. S.Modha, andD. J. Friedman,1112 “A 45nm cmos neuromorphic chip with a scalable architecture for1113 learning in networks of spiking neurons,” inProc. IEEECustom Integr.1114 Circuits Conf., Sep. 2011, pp. 1–4.1115 [36] J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim, “14.6 a1116 1.42tops/w deep convolutional neural network recognition pro-1117 cessor for intelligent ioe systems,” in Proc. IEEE Int. Solid-State Cir-1118 cuits Conf., Jan. 2016, pp. 264–265.1119 [37] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and1120 A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural net-1121 work computing,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Com-1122 put. Archit., 2016, pp. 1–13.1123 [38] A. Rahman, J. Lee, and K. Choi, “Efficient fpga acceleration of con-1124 volutional neural networks using logical-3d compute array,” in1125 Proc. Des. Autom. Test Eur. Conf. Exhib., 2016, pp. 1393–1398.1126 [39] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong,1127 “Optimizing fpga-based accelerator design for deep convolutional1128 neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-Program-1129 mable Gate Arrays, 2015, pp. 161–170.1130 [40] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,1131 N. Xu, S. Song, et al., “Going deeper with embedded fpga plat-1132 form for convolutional neural network,” in Proc. ACM/SIGDA Int.1133 Symp. Field-Programmable Gate Arrays, 2016, pp. 26–35.1134 [41] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and1135 W. J. Dally, “Eie: Efficient inference engine on compressed deep1136 neural network,” in Proc. ACM/IEEE 43rd Ann. Int. Symp. Comput.1137 Architecture, Seoul, 2016, pp. 243–254.1138 [42] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-1139 based processor for convolutional networks,” in Proc. Int. Conf.1140 Field Program. Logic Appl., 2009, pp. 32–37.1141 [43] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic,1142 E. Cosatto, and H. P. Graf, “Amassively parallel coprocessor for con-1143 volutional neural networks,” in Proc. 20th IEEE Int. Conf. Appl.-Specific1144 Syst. Archit. Processors, Jul. 2009, pp. 53–60.1145 [44] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 2401146 g-ops/s mobile coprocessor for deep neural networks,” in Proc.1147 IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014,1148 pp. 696–701.1149 [45] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-1150 formance fpga-based accelerator for large-scale convolutional1151 neural networks,” in Proc. 26th Int. Conf. Field Program. Logic Appl.,1152 Aug. 2016, pp. 1–9.1153 [46] Y. Shen, M. Ferdman, and P. Milder, “Overcoming resource1154 underutilization in spatial cnn accelerators,” in Proc. 26th Int.1155 Conf. Field Program. Logic Appl., 2016, pp. 1–4.1156 [47] C.Wang, L. Gong,Q.Yu, X. Li, Y. Xie, andX. Zhou, “Dlau:A scalable1157 deep learning accelerator unit on fpga,” IEEE Trans. Comput.-Aided1158 Des. Integr. Circuits Syst., vol. 36, no. 3, pp. 513–517,Mar. 2017.1159 [48] J. Ouyang, “Sda: Software-defined accelerator for large-scale deep1160 learning system,” in Proc. Int. Symp. VLSI Des. Autom. Test,1161 Apr. 2016, pp. 1–1.1162 [49] L. He, Y. Luo, and Y. Cao, “Accelerator of stacked convolutional1163 independent subspace analysis for deep learning-based action1164 recognition,” in Proc. IEEE 22nd Annu. Int. Symp. Field-Programma-1165 ble Custom Comput. Mach., May 2014, pp. 104–104.1166 [50] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst,1167 “Envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-1168 accuracy-frequency-scalable convolutional neural network proces-1169 sor in 28nm fdsoi,” in Proc. IEEE Int. Solid-State Circuits Conf., 2017,1170 pp. 246–248.1171 [51] M. L. Campbell, S. T. Toborg, and S. L. Taylor, “3d wafer stack1172 neurocomputing,” in Proc. IEEE Int. Conf. Wafer Scale Integr., 1993,1173 pp. 67–74.

1174[52] I. T. Wang, Y. C. Lin, Y. F. Wang, C. W. Hsu, and T. H. Hou, “3d1175synaptic architecture with ultralow sub-10 fj energy per spike for1176neuromorphic computation,” Electron Devices Meeting .iedm.techni-1177cal Digest.international, vol. 2015, pp. 28.5.1–28.5.4, 2014.1178[53] JEDEC, “JEDEC 3D ICs Interface,” 2016. [Online]. Available:1179https://www.jedec.org/category/technology-focus-area/3d-ics1180[54] T. Kgil, S. D’Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge,1181S. Reinhardt, and K. Flautner, “Picoserver: Using 3d stacking tech-1182nology to enable a compact energy efficient chip multiprocessor,”1183in Proc. 12th Int. Conf. Architectural Support Programming Lang.1184Operating Syst., 2006, pp. 117–128.1185[55] G. L. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, and1186K. Banerjee, “A thermally-aware performance analysis of verti-1187cally integrated (3-d) processor-memory hierarchy,” in Proc. 43rd1188Annu. Des. Autom. Conf., 2006, pp. 991–996.1189[56] G. H. Loh, “3d-stacked memory architectures for multi-core pro-1190cessors,” in Proc. 35th Annu. Int. Symp. Comput. Archit., 2008,1191pp. 453–464.1192[57] W. Computing, “Wave Computing’s Native Dataflow Technology,”11932018. [Online]. Available: https://wavecomp.ai/technology1194[58] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer1195cnn accelerators,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Micro-1196architecture, 2016, pp. 1–12.1197[59] N. Suda, V.Chandra,G.Dasika,A.Mohanty, Y.Ma, S. B.K. Vrudhula,1198J. Seo, and Y. Cao, “Throughput-optimized opencl-based FPGA accel-1199erator for large-scale convolutional neural networks,” in Proc. ACM/1200SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016, pp. 16–25.1201[60] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and1202Y. Chen, “Cambricon-x: An accelerator for sparse neural networks,”1203in Proc. IEEE/ACM Int. Symp.Microarchitecture, 2016, pp. 1–12.1204[61] J. Zhao, X. Dong, and Y. Xie, “Cost-aware three-dimensional (3D)1205many-core multiprocessor design,” in Proc. 47th Des. Autom. Conf.,12062010, pp. 126–131.1207[62] JEDEC, DDR3 SDRAM Specification, JESD79–3E, Jul. 2010,1208https://www.jedec.org/standards-documents/docs/jesd-79-3d.1209[63] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-aware1210intelligent DRAM refresh,” in Proc. 39th Annu. Int. Symp. Comput.1211Archit., 2012, pp. 1–12.1212[64] D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park,1213J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon,1214M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong, “25.2 a 1.2 v12158 gb 8-channel 128 gb/s high-bandwidth memory (HBM) stacked1216dram with effective microbump i/o test methods using 29 nm1217process and tsv,” in Proc. IEEE Int. Solid-State Circuits Conf. Digest1218Tech. Papers, Feb. 2014, pp. 432–433.

1219Shouyi Yin received theBS,MS, andPhDdegrees1220in electronic engineering from Tsinghua University,1221China, in 2000, 2002, and 2005, respectively. He1222has worked with Imperial College London as a1223research associate. Currently he iswith the Institute1224of Microelectronics, Tsinghua University as an1225associate professor. His research interests include1226SoC design, reconfigurable computing and mobile1227computing. He has published more than 40 refer-1228eed papers, and served as TPC member or1229reviewer for the international key conferences and1230leading journals.

1231Shibin Tang received the BS degree in computer1232science and technology fromShandong University,1233in 2008, the PhD degree in computer architecture1234from the Institute of Computing Technology, Chi-1235neseAcademy ofSciences, in 2014. He is currently1236a postdoctoral researcher with the Institute1237of Microelectronics, Tsinghua University. His1238research interests include computer architecture,1239cache coherence, on chip memory system, deep1240learning, and neural network acceleration.

14 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. X, XXXXX 2018

Page 15: Parana: A Parallel Neural Architecture Considering Thermal

IEEE P

roof

1241 Xinhan Lin received the BS degree in computer1242 science and technology from Sichuan University,1243 Chengdu, China, in 2010, and the MS degree1244 from the Institute of Microelectronics, Tsinghua1245 University, Beijing, China, in 2016. Currently he1246 is working toward the PhD degree in the Institute1247 of Microelectronics, Tsinghua University, Beijing,1248 China. His research interests include reconfigura-1249 ble computing and optimization of compiler for1250 reconfigurable computing.

1251

1252 Peng Ouyang received the BS degree in elec-1253 tronic and information technology from Center1254 South University, Changsha, Hunan China, in1255 2008, and the PhD degree in electronic science1256 and technology from Tsinghua University, in1257 2014. During 2015 to 2017, he worked as a post1258 doctor with the School of Information of Tsinghua1259 University. Now he is an assistant professor with1260 Bei Hang University. His research interests1261 include the computer vision, deep learning and1262 reconfigurable computing.

1263 Fengbin Tu received the BS degree in electronic1264 science and technology from Beijing University of1265 Posts and Telecommunications, Beijing, China, in1266 2013. He is currently working toward the PhD1267 degree in the Institute of Microelectronics, Tsing-1268 hua University, Beijing, China. His current research1269 interests include deep learning, computer architec-1270 ture, VLSI Design, approximate computing and1271 reconfigurable computing.

1272 Leibo Liu received the BS degree in electronic1273 engineering from Tsinghua University, Beijing,1274 China, in 1999, and the PhD degree in the Institute1275 of Microelectronics, Tsinghua University, in 2004.1276 He now serves as an associate professor with the1277 Institute of Microelectronics, Tsinghua University.1278 His research interests include reconfigurable com-1279 puting, mobile computing and VLSI DSP.

1280 Jishen Zhao received the PhD degree from Penn-1281 sylvania State University, in 2014. She is an assis-1282 tant professor in the Department of Computer1283 Engineering, University of California, San Diego.1284 Her research spans and stretches the boundary1285 between computer architecture and system soft-1286 ware, with a particular emphasis on memory and1287 storage systems, domain-specific acceleration,1288 and high-performance computing. Her research is1289 driven byboth emerging device/circuit technologies1290 (e.g., 3D integration, nonvolatile memories) and1291 modern applications (e.g., big-data analytics, machine learning, and scien-1292 tific computing). She is amember of the IEEEandACM.

1293Cong Xu received the PhD degree in computer1294science and engineering from Pennsylvania State1295University, in 2015, and worked as a postdoc with1296UCSB before joining Hewlett Packard Labs as a1297research scientist, in 2016. His research interests1298include deep learning acceleration, high perfor-1299mance computing, and distributed file system. He1300has published more than 30 papers on non-vola-1301tile memory and deep learning acceleration. He is1302a member of the IEEE and ACM.

1303Shuangchen Li received the PhD degree in ECE1304from the University of California, Santa Barbara, in13052018. He works on memory related computer archi-1306tecture, with emphasis on processing-in-memory1307architectures, emerging non-volatile technologies,1308and deep learning accelerators.

1309Yuan Xie received the PhD degree from Prince-1310ton University, and has worked with IBM, AMD,1311and Penn State, before he joined UCSB, in 2014.1312He is a professor with UCSB. His research inter-1313ests include computer architecture, design auto-1314mation, and VLSI. He is a fellow of the IEEE.

1315ShaojunWei received thePhDdegree fromFaulte1316Polytechnique de Mons, Belguim, in 1991. He1317became a professor with the Institute of Microelec-1318tronics of TsinghuaUniversity, in 1995. He is senior1319member of the Chinese Institute of Electronics. His1320main research interests include VLSI SoC design,1321EDAmethodology, and ASIC design.

1322" For more information on this or any other computing topic,1323please visit our Digital Library at www.computer.org/publications/dlib.

YIN ET AL.: PARANA: A PARALLEL NEURAL ARCHITECTURE CONSIDERING THERMAL PROBLEM OF 3D STACKED MEMORY 15