session 13 overview: machine learning and signal processing · this session covers trends in...

214 • 2018 IEEE International Solid-State Circuits Conference 978-1-5090-4940-0/18/$31.00 ©2018 IEEE

ISSCC 2018 / SESSION 13 / MACHINE LEARNING AND SIGNAL PROCESSING / OVERVIEW

Session 13 Overview: Machine Learning and Signal ProcessingDIGITAL ARCHITECTURES AND SYSTEMS SUBCOMMITTEE

1:30 PM13.1 A Shift Towards Edge Machine-Learning Processing

O. Temam, Google, Paris, FranceThe field of machine learning, especially Deep Neural Networks, is advancing at a breathtaking pace, with new functionalitiesachieved on a monthly basis. In the span of a few years, close to human-level accuracy has been achieved for simple voicecommands, then full speech recognition, speech synthesis, translation, and increasing progress has been achieved in languageunderstanding.

Machine-learning researchers largely acknowledge that the current successes of deep neural networks have been fueled by twoevolutions: the availability of a large quantity of data for training, and the availability of high-performance computing at low cost,initially enabled by GPUs. Both advances combined to make training times tractable for large neural-network models.

Beyond GPUs, both the broad application span and high computational cost of deep neural networks have made custom machine-learning hardware economically sensible, and such architectures are currently being developed by many hardware, cloud or startupcompanies. The number of competing companies and the broad dissemination of knowledge on how to design such hardwareshould help reduce the cost of neural network computing, and make sophisticated machine learning more accessible in the comingyears.

As Moore's Law plateaus, one of the main paths forward is increasing customization for increased efficiency. This trend willparadoxically first arise at the edge (vs. in the data center), where hardware efficiency is most critical. Unfortunately, the need forcustomization/efficiency runs contrary to the very fast evolution of machine-learning algorithms. Traditional architecturalapproaches for achieving generality, while great for general-purpose computing, may not be best suited for resolving this tensionbetween efficiency, velocity and generality.

Beyond hardware efficiency challenges, the other key challenge remains access to data. Consumer data privacy, corporate dataconfidentiality, or even regulatory compliance force a shift towards processing data closer to where it exists, i.e., at the edge. Formany applications, doing so also provides useful, if not indispensable, latency, bandwidth and connectivity benefits for manyapplications. It even provides an out-of-the-box way to tackle the economic consequences of a plateauing Moore's Law for fastgrowing data-center machine-learning applications.

The talk will go over these different trends and their consequences.

Subcommittee Chair: Byeong-Gyu Nam, Chungnam National University, Daejeon, KoreaArchitectures supporting machine learning for embedded perception and cognition are continuing their rapid evolution, inspired bymodern data analytics and enabled by the low energy cost of CMOS processing. This makes it feasible to migrate data analyticstoward edge and wearable devices. To further support increased requirements for multiuser connectivity and sparse data, multi-user MIMO and compressive reconstruction are also required.

This session covers trends in machine learning and signal processing for improved accuracy of speech, image, video processing fornext-generation mobile/edge and data center devices. The session features programmable accelerators for Convolutional NeuralNetworks (CNNs), Recurrent Neural Networks (RNNs), and Multi-Layer Perceptron (MLP) algorithms, with digital and mixed-signalprocessing kernels. The session concludes with a link-adaptive massive MIMO detector and robust compressive-sensingreconstruction processors.

Session Chair:Dejan Marković

University of California, Los Angeles, Los Angeles, CA

Associate Chair:Masato Motomura

Hokkaido University, Sapporo, Japan

INVITED PAPER

215DIGEST OF TECHNICAL PAPERS •

ISSCC 2018 / February 13, 2018 / 1:30 PM

3:15 PM13.4 A 9.02mW CNN-Stereo-Based Real-Time 3D Hand-Gesture Recognition Processor for Smart Mobile

DevicesS. Choi, KAIST, Daejeon, KoreaIn Paper 13.4, KAIST presents a 3D hand-gesture recognition processor for real-time user interaction insmart mobile devices. With a CNN stereo engine, triple ping-pong buffers, and processor-in-memorytechniques, the 16mm2 65nm processor achieves real-time 3D hand-gesture recognition with 9.02mWand 4.3mm error at 0.85V, 50MHz.

3:45 PM13.5 An Always-On 3.8μJ/86% CIFAR-10 Mixed-Signal Binary CNN Processor with All Memory on Chip in

28nm CMOSD. Bankman, Stanford University, Stanford, CAIn Paper 13.5, Stanford University and KU Leuven introduce a mixed-signal binary CNN processor basedon near-memory computing. The 2.4×2.4mm2 28nm 0.6V processor features 328KB of on-chip SRAMfor a 9-layer CNN, data parallelism and parameter re-use, achieving a 3.8μJ/classification at 86.05%accuracy on the CIFAR-10 dataset.

4:15 PM13.6 A 1.8Gb/s 70.6pJ/b 128×16 Link-Adaptive Near-Optimal Massive MIMO Detector in 28nm UTBB-FDSOI

W. Tang, University of Michigan, Ann Arbor, MIIn Paper 13.6, the University of Michigan describes 128×16 massive MIMO detector with link adaptationto meet practical channel conditions with scalable energy. Implemented as a condensed systolic array,the 2mm2 28nm FDSOI chip achieves 1.8Gb/s at 70pJ/b, 569MHz and 4.3dB processing gain with channeldata obtained from real-life measurements.

4:45 PM13.7 A 232-to-1996KS/s Robust Compressive-Sensing Reconstruction Engine for Real-Time Physiological

Signals MonitoringT-S. Chen, National Taiwan University, Taipei, TaiwanIn Paper 13.7, National Taiwan University presents a compressive-sensing reconstruction engine withparallel atom searching approach to reduce signal distortion due to measurement noise. The2.93×2.93mm2 40nm processor achieves up to 1996KS/s with 93mW power consumption at 0.9V,67.5MHz.

2:00 PM13.2 QUEST: A 7.49TOPS Multi-Purpose Log-Quantized DNN Inference Engine Stacked on 96MB 3D SRAM

Using Inductive-Coupling Technology in 40nm CMOSK. Ueyoshi, Hokkaido University, Sapporo, JapanIn Paper 13.2, Hokkaido University presents a 14.3×8.5mm2 multi-purpose log-quantized deep neuralnetwork (DNN) inference engine stacked on a 96MB 3D SRAM using inductive coupling technology in40nm. The system features 3-cycle 28.8GB/s memory communication and 7.49TOPS peak performancein binary precision at 1.1V, 300MHz, for cutting-edge DNN workloads.

2:30 PM13.3 UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight

Bit-PrecisionJ. Lee, KAIST, Daejeon, KoreaIn Paper 13.3, KAIST describes a DNN accelerator with variable bit precision from 1b to 16b. Using aflexible DNN core architecture, look-up-table-based bit-serial processing, and off-chip memorymanagement, the 16mm2 65nm chip achieves 50.6TOPS/W energy efficiency for 1b data at 10MHz, 0.66V.

13

216 • 2018 IEEE International Solid-State Circuits Conference

ISSCC 2018 / SESSION 13 / MACHINE LEARNING AND SIGNAL PROCESSING / 13.2

13.2 QUEST: A 7.49TOPS Multi-Purpose Log-Quantized DNN Inference Engine Stacked on 96MB 3D SRAM Using Inductive-Coupling Technology in 40nm CMOS

Kodai Ueyoshi1, Kota Ando1, Kazutoshi Hirose1, Shinya Takamaeda-Yamazaki1, Junichiro Kadomoto2, Tomoki Miyata2, Mototsugu Hamada2, Tadahiro Kuroda2, Masato Motomura1

1Hokkaido University, Sapporo, Japan2Keio University, Yokohama, Japan

A key consideration for deep neural network (DNN) inference accelerators is theneed for large and high-bandwidth external memories. Although an architecturalconcept for stacking a DNN accelerator with DRAMs has been proposedpreviously, long DRAM latency remains problematic and limits the performance[1]. Recent algorithm-level optimizations, such as network pruning andcompression, have shown success in reducing the DNN memory size [2];however, since networks become irregular and sparse, they induce an additionalneed for agile random accesses to the memory systems.

Figure 13.2.1 illustrates our proposal: stacking a DNN inference engine, QUEST,with multi-vault SRAMs using inductive-coupling die-to-die wirelesscommunication technology, known as a ThruChip Interface (TCI) [3]. Parallel TCIchannels placed in a planar manner provide QUEST with multiple independenthigh-bandwidth access points to the stacked SRAMs. SRAMs can provide randomaccess capability with extremely low latency (an order of magnitude lower thanDRAMs), whereas 3D stacking helps SRAMs achieve reasonably large memorycapacity. In Fig. 13.2.1, QUEST and 8 SRAMs are TCI-stacked as a single 14.3×8.5mm2 3D module. Power/ground are supplied through TSVs. QUEST has 24processing cores running at 300MHz, where each core is associated with one32b-width 4MB SRAM vault. Running at 3.6GHz, a TCI channel (7-Tx/5-Rx coils)realizes 9.6Gb/s/vault, combined 28.8GB/s/module, R/W data bandwidth in asource synchronous manner. The R/W latency including TCI trip time is 3 cycles,which is uniform over the 8 SRAMs. TSV technology, used commonly for diestacking, is known to experience open-contact failure. In our design, however,since all signal transmissions are conducted by wireless TCI channels, thepresented 3D module can limit the usage of TSVs to power/ground grids, wherenumerous parallel connections negate this concern.

Figure 13.2.2 shows the overall block diagram of the fabricated QUEST prototype.The 24 cores run in a MIMD-parallel manner, where inter-core communication ishandled either with a mesh-structured local link or tree-structured global network.Each core has a micro-programmed sequencer for setting and controlling the PEarray. Synchronization among the cores is managed through a synchronizationtable when needed. Each core also has a DMAC, which issues memory accessesto intra-core memories (shaded) and the stacked SRAM vault in response tointra/inter-core memory requests. The 32×16 PE array features a bit-serialarchitecture: the PE conducts binary computation in a single cycle, and N-bit log-quantized ones in N cycles (N<5). Weights double-buffered in W_MEMs aredistributed to the PE array in a fully parallel manner, whereas incoming activationsalso double-buffered in A_MEMs are broadcast in a row-parallel manner. The ACTunit at the tail of each column applies the bias (shifted-in from B_MEMs), scaling,and activation function; and then writes the output activations into O_MEMs.

Unlike other array-structured DNN accelerators, all PEs receive unique weightbits, whereas the PEs in a row receive an identical activation bit, as detailed inFig. 13.2.3. In a PE column, partial dot products are first generated in parallel inPEs and then simply shifted towards ACT where they are accumulated. Thepipelined shifts hide behind the PE-parallel, bit-serial, dot-product computationsas shown in the time chart. These mechanisms are key enablers for handlingvarious DNNs on a single homogeneous PE array (whereas [5] and [6] use hybridcores and hybrid PEs, respectively, for different DNN types): e.g., for a fullyconnected (FC) layer in a CNN, MLP, or RNN, up to 32 fan-ins for a neuron aremapped onto a PE column at a time and then time multiplexed on the samecolumn. For a convolutional (CONV) layer in CNN, on the other hand, up to 32input channels are mapped onto a PE column at a time and then time multiplexed.The filter kernel is stored vertically in the W_MEMs and processed in an element-by-element, kernel-parallel manner.

Figure 13.2.4 presents a log-quantized neural datapath. The log-quantizationmethod [4] is superior to linear quantization in two ways: 1) its “denser the finer”

approach allows it to represent weight/activation distributions better, and 2)resource-consuming multiply operations are reduced to additions. Dot-productsare computed in PEs by “log” bit-serial addition and “linear” accumulation. ACTaccumulates the dot products and adds a bias in “linear”, then applies ascaling/activation function such as ReLU in “log”. The lightweight PE architecturehas enabled the dense PE array to be tightly coupled (bit-by-bit) with W_MEMsand A_MEMs (Fig. 13.2.2), achieving versatile parallel NN computation (Fig.13.2.3). Log-quantization inference accuracy is evaluated on AlexNet (forImageNet) and on LeNet-5 (for MNIST). Log-4 (4b log-quantized) AlexNet showsonly marginal degradation compared with FP-32, whereas Log-3 was destructive.Even binary can attain reasonable accuracy for LeNet-5: the performance-accuracytrade-off is also indicated in the figure.

Figure 13.2.5 depicts AlexNet mapped on the QUEST 3D module, where 24 parallelcores process the inference in a layer-by-layer manner, producing/readingintermediate results to/from the SRAMs, respectively. For a CONV layer, an outputchannel is mapped spatially among the cores so that a “cluster” of cores can sharesame input channels. For a FC layer, output neurons are mapped onto all the coresevenly, requiring all-to-all shuffling data-distribution patterns. In both cases,computation in a core must read activations from another core’s SRAM vaults,that are delivered through the TCI channels and the on-chip networks. Theaccesses are scattered across individual memory spaces, and burst lengths arevery short (1 to 4 for this mapping). Fig. 13.2.5 summarizes the performance ofthe Log-4 AlexNet, which occupies 39% of the 3D SRAM, as well as Log-4 andbinary VGG11 (for CIFAR-10). It is shown that for AlexNet, having more than2.9MB of on-chip memory is crucial for sustaining above 90% effective/peakperformance. The 3-cycle short random-access latency of the 3D SRAM, on theother hand, is also indispensable for effective performance, assuming burstmemory access with 30-cycle initial latency, which mimics modern DRAM latency,effective VGG11 performance degrades drastically for Log-4 and binary cases.Larger DNNs such as ResNet, moreover, require aggressive pruning to fit thelimited memory space, where the presented random-access capability of the 3Dmodule will become even more indispensable.

Figure 13.2.6 compares recently reported multi-purpose (CNN/FC/RNN, etc.) DNNaccelerators [5] and [6], using LUT-based and linear quantization, respectively,with this work. Those works integrated a limited amount of on-chip SRAM (around300KB), and did not include external memory for power estimation. QUEST, onthe other hand, integrates 7.68MB large on-chip SRAM (sufficient for AlexNet on-chip buffering) in addition to the 96MB 3D SRAM. It achieves 5× better effectiveperformance on AlexNet benchmark at 4b precision. Since external memoryaccesses are responsible for the majority of the power dissipation, and since the3D SRAM can substantially reduce external memory power in comparison toDRAMs, system-level energy efficiency favors the proposed solution.

Figure 13.2.7 shows a QUEST prototype microphotograph with a specificationtable. To summarize, QUEST is aimed toward rapidly revolutionizing highlycompressed (bit-reduced, pruned, etc.) DNNs with three main architecturalfeatures: 1) 3D integration with large capacity/bandwidth yet low-latency randomaccess SRAM, 2) flexible dataflow support in the PE array for CONV/FC and othertypes of DNN layers, 3) a bit-serial PE architecture for binarized and log-quantizedDNN representations.

Acknowledgements:This work was funded by JST ACCEL Grant Number JPMJAC1502, Japan. Theauthors thank Profs. T. Asai, M. Ikebe, E. Sano, M. Arita from Hokkaido Universityand the colleagues at UltraMemory Inc. for their invaluable support.

References:[1] M. Gao, et al., “TETRIS: Scalable and Efficient Neural Network Accelerationwith 3D Memory,” ACM ASPLOS, pp. 751-764, 2017.[2] A. Parashar, et al., “SCNN: An Accelerator for Compressed-sparseConvolutional Neural Networks,” ACM ISCA, pp. 27-40, 2017.[3] D. Ditzel, et al., “Low-Cost 3D Chip Stacking with ThruChip WirelessConnections," IEEE Hot Chips, pp. 1-37, 2014.[4] D. Miyashita, et al., “Convolutional Neural Networks using Logarithmic DataRepresentation,” arXiv: 1603.01025 [cs.NE], 2016.[5] D. Shin, et al., “DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processorfor General-Purpose Deep Neural Networks,” ISSCC, pp. 240-241, 2017.[6] S. Yin, et al., “A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-NetworkProcessor for Deep Learning Applications,” IEEE Symp. on VLSI Circuits, 2017.

978-1-5090-4940-0/18/$31.00 ©2018 IEEE



Figure 13.2.1: QUEST module overview and 3-cycle R/W latency on 3D SRAM.Figure 13.2.2: Overall architecture of the proposed DNN inference engine(QUEST).

Figure 13.2.3: Detailed computation in a PE column (top). FC/CONV dataflowon a homogeneous PE array (bottom).

Figure 13.2.5: AlexNet mapped on the QUEST (left). Throughput across differentmemory capacity/latency (right).

Figure 13.2.6: Comparison with state-of-the-art multi-purpose DNNaccelerators.

Figure 13.2.4: A log-quantized neural datapath in PE and ACT, and accuracyevaluation on realistic networks.

13

• 2018 IEEE International Solid-State Circuits Conference 978-1-5090-4940-0/18/$31.00 ©2018 IEEE

ISSCC 2018 PAPER CONTINUATIONS

Figure 13.2.7: A microphotograph of the QUEST prototype, along with the chipspecification summary.



13.3 UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision

Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, Hoi-Jun Yoo

KAIST, Daejeon, Korea

Deep neural network (DNN) accelerators [1-3] have been proposed to acceleratedeep learning algorithms from face recognition to emotion recognition in mobileor embedded environments [3]. However, most works accelerate only theconvolutional layers (CLs) or fully-connected layers (FCLs), and different DNNs,such as those containing recurrent layers (RLs) (useful for emotion recognition)have not been supported in hardware. A combined CNN-RNN accelerator [1],separately optimizing the computation-dominant CLs, and memory-dominant RLsor FCLs, was reported to increase overall performance, however, the number ofprocessing elements (PEs) for CLs and RLs was limited by their area andconsequently, performance was suboptimal in scenarios requiring only CLs oronly RLs. Although the PEs for RLs can be reconfigured into PEs for CLs or viceversa, only a partial reconfiguration was possible resulting in marginalperformance improvement. Moreover, previous works [1-2] supported a limitedset of weight bit precisions, such as either 4b or 8b or 16b. However, lower weightbit-precisions can achieve better throughput and higher energy efficiency, andthe optimal bit-precision can be varied according to differentaccuracy/performance requirements. Therefore, a unified DNN accelerator withfully-variable weight bit-precision is required for the energy-optimal operation ofDNNs within a mobile environment.

In this paper, we present a unified neural processing unit (UNPU) supporting CLs,RLs, and FCLs with fully-variable weight bit-precision from 1b to 16b. As shownin Fig. 13.3.1, the reuse of input features (IFs) is more efficient than the reuse ofweights under low-weight bit-precision and the operations of CLs becomeidentical to those of RLs and FCLs when the IFs of the CLs are vectorized into a1-dimensional vector so that the hardware can be fully shared in the UNPU by IFreuse. Moreover, the lookup-table-based bit-serial PE (LBPE) is implemented forenergy-optimal DNN operations with variable-weight bit-precisions from 1b to16b through iterations of 1b weight operations. Furthermore, an aligned featureloader (AFL) minimizes the amount of off-chip memory accesses required to fetchIFs by exploiting the data locality among convolution operations.

Figure 13.3.2 shows the overall architecture of the UNPU. It consists of 4 DNNcores, an aggregation core, a 1D SIMD core, and a RISC controller. All of thesecomponents are connected to an on-chip network for communication. Each DNNcore has 6 LBPEs, 6 AFLs (64×6), a weight memory (48KB), an instructiondecoder and a controller. The LBPE receives aligned IF as an input operandthrough AFLs and calculates 576 (4×12×12) multiplications in parallel in a bit-serial manner. The partial-sums (Psums) calculated by each DNN core areaggregated to an output feature (OF) in the aggregation core. The 1D SIMD coreperforms the remaining operations, such as non-linear activation or pooling, andthe results are stored in off-chip memory through the external gateways.

Figure 13.3.3 elaborates on workload allocation. For RLs and FCLs, its 1D IF ismapped to AFLs with one-to-one (48×1) and sent to a PE. The weights are loadedfrom 12 channels of OF (48×12b) to calculate multiple channels of Psums withthe same IF. For a CL, IFs distributed over multiple input channels areconcatenated into a 1D row vector and loaded into the AFLs, as is done with RLsand FCLs. The weights of CLs are converted into 1D column vectors and then thePsums are calculated by multiplying with the 1D IF row vector. 4 LBPEs in a DNNcore calculate the product between 48 pairs of IFs and weights, and each LBPEcorresponds to 12 IF-weight pairs. The IF is reused for multiple column vectorsfrom other channels. The Psums from each PE are accumulated by 12 adder trees.The weights are reused among the 6 LBPEs for better energy efficiency. Forexample, in RLs and FCLs, the 6 different IFs are assigned to 6 LBPEs in parallelwith the same weights if batch-wise parallelism is possible. For a CL, the 6consecutive IFs in the same channel are multiplied with the same weights in 6different LBPEs in parallel. Peak performance for CLs and RLs (or FCLs) isincreased by 1.15× and 13.8×, respectively, compared to [1] owing to the highercompute density of the unified DNN core.

Figure 13.3.4 shows the architecture of the LBPE. The key idea of the LBPE is thatpartial-sums are repeatedly calculated during the weight bit-serial MAC operation.A LBPE consists of 4 PE clusters, adder trees to accumulate the results of eachPE cluster, and shift-and-add logic for bit-serial multiplications. Each PE clustercontains 4 look-up-table (LUT) modules and a controller that determines whetherthe value from LUTs is added or subtracted. In the LUT module, a table with 8entries is used, supporting 3-way MAC for multi-bit multiplication and 4-way MACfor 1b multiplication. The LUT is updated after IFs load into the AFLs, and IF valuesare reused for all output channels of the layer currently being processed. The 1bweight Psums are fetched from the LUT prepared in advance and accumulatedfor MAC operation. The LUT can fetch 12 Psums in parallel so that a total of 48×12Psums (64×12 for 1b case) can be calculated simultaneously on a LBPE in 1 cycle.With the help of table-based operations, the LBPE improves energy efficiencymore than conventional bit-serial PEs [4]. When IFs are reused 1024 times, theenergy-consumption of LBPEs, including the LUT update, is reduced by 23.1%,27.2%, 41.0%, and 53.6%, for the case of 16b, 8b, 4b, and 1b weight operations,respectively, compared with fixed-point MAC units under the same throughputconditions.

Figure 13.3.5 explains the AFL. 6 AFL-LBPE pairs are integrated in a DNN coreand each AFL has 64 entries. The data in the AFL can be shifted diagonally acrossAFL boundaries, as well as shifted inside the AFL itself. In the case of CLs with3×3 kernels and stride 1, 8 entries of IF from Ch. 1 are loaded on AFL 0 at first.At the next cycle, the 7 top entries of AFL 0, except the top-most entry, are shifteddiagonally to AFL 1, while the 8 entries from Ch. 2 are concatenated below theremaining 3 entries of Ch. 1 on AFL 0. And then, 8 entries from Ch. 3 areconcatenated below the remaining 3 entries of Ch. 2 on AFL 0, while 6 entries onAFL 1 from Ch. 1 are shifted diagonally to AFL 2, and 7 entries from Ch. 2 areshifted diagonally to concatenate below the remaining 3 entries from Ch. 1 onAFL 1. Iterations of diagonal shifts allocate a 3×3 kernel to each AFL or LBPE sothat parallel multiplication is possible to accelerate convolution. Varied stride sizesare supported via the application of multiple shifts. The AFL keeps the PEutilization high, unlike an architecture that moves data only between PEs. Inaddition, it can skip zeros by an upward-shift within the buffer. When the AFL isapplied to AlexNet and VGG-16, external memory access operations for IF loadare reduced by 57.2% and 55%, respectively.

Figure 13.3.6 shows measurement results for the fabricated UNPU. The UNPUcan operate at 0.63-to-1.1V supply voltage with a maximum 200MHz clockfrequency. The power consumption at 0.63V and 1.1V is 3.2mW and 297mW,respectively. The power-efficiency, as measured on CLs (5×5 kernels) withconsideration of PE utilization is 3.08, 11.6, and 50.6TOPS/W for the case of 16b,4b, and 1b weights, respectively. The architecture supports any weight bit-precision from 1b to 16b for optimal DNN operation and shows 1.43× higherpower efficiency for CLs at 4b weight compared to [1]. When operating on a 1bweight network, it achieves 8.43× higher efficiency and 7.4× higher peakperformance as compared to [6].

The UNPU is fabricated using 65nm CMOS technology and occupies 16mm2 diearea, as shown in the Fig. 13.3.7. The UNPU has been demonstrated successfullyon facial expression recognition and dialogue generation tasks with the FER2013and the Twitter dialogue database for human-computer interaction, respectively.

References:[1] D. Shin, et al., "DNPU: An 8.1 TOPS/W Reconfigurable CNN-RNN Processorfor General-Purpose Deep Neural Networks," ISSCC, pp. 240-241, 2017[2] B. Moons, et al., "Envision: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Coltage-Accuracy-Frequency-Scalable Convolutional Neural Network processorin 28nm FDSOI," ISSCC, pp. 246-247, 2017.[3] K. Bong, et al., “A 0.62mW Ultra-Low-Power Convolutional-Neural-NetworkFace-Recognition Processor and a CIS Integrated with Allways-On Haar-Like FaceDetector,” ISSCC, pp.248-249, 2017[4] P. Judd, et al., “Stripes: Bit-serial Deep Neural Network Computing,” IEEEComputer Architecture Letters, vol. 16, no. 1, pp. 80-83, Jan.-June 1 2017.[5] S. Yin, et al., "A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-NetworkProcessor for Deep Learning Applications," IEEE Symp. VLSI Circuits, 2017.[6] K. Ando, et al., “BRein memory: A 13-Layer 4.2 K Neuron/0.8 M SynapseBinary/Ternary Reconfigurable In-Memory Deep Neural Network Accelerator in65 nm CMOS," IEEE Symp. VLSI Circuits, 2017.

978-1-5090-4940-0/18/$31.00 ©2018 IEEE



Figure 13.3.1: Fully reconfigurable unified DNN accelerator with bit-serial PEs. Figure 13.3.2: Overall architecture.

Figure 13.3.3: Workload allocation on the unified DNN core.

Figure 13.3.5: Aligned feature loader for reduction of off-chip memoryaccesses. Figure 13.3.6: Measurement results and performance comparison table.

Figure 13.3.4: LUT-based bit-serial processing elements.

13



Figure 13.3.7: Chip micrograph and performance summary.



13.4 A 9.02mW CNN-Stereo-Based Real-Time 3D Hand-Gesture Recognition Processor for Smart Mobile Devices

Sungpill Choi, Jinsu Lee, Kyuho Lee, Hoi-Jun Yoo

KAIST, Daejeon, Korea

Recently, 3D hand-gesture recognition (HGR) has become an important featurein smart mobile devices, such as head-mounted displays (HMDs) or smartphonesfor AR/VR applications. A 3D HGR system in Fig. 13.4.1 enables users to interactwith virtual 3D objects using depth sensing and hand tracking. However, aprevious 3D HGR system, such as Hololens [1], utilized a power consuming time-of-flight (ToF) depth sensor (>2W) limiting 3D HGR operation to less than 3 hours.Even though stereo matching was used instead of ToF for depth sensing with lowpower consumption [2], it could not provide interaction with virtual 3D objectsbecause depth information was used only for hand segmentation. The HGR-basedUI system in smart mobile devices, such as HMDs, must be low powerconsumption (<10mW), while maintaining real-time operation (<33.3ms). Aconvolutional neural network (CNN) can be adopted to enhance the accuracy ofthe low-power stereo matching. The CNN-based HGR system comprises two 6-layer CNNs (stereo) without any pooling layers to preserve geometricalinformation and an iterative-closest-point/particle-swarm optimization-based(ICP-PSO) hand tracking to acquire 3D coordinates of a user’s fingertips and palmfrom the hand depth. The CNN learns the skin color and texture to detect the handaccurately, comparable to ToF, in the low-power stereo matching systemirrespective of variations in external conditions [3]. However, it requires >1000more MAC operations than previous feature-based stereo depth sensing, whichis difficult in real-time with a mobile CPU, and therefore, a dedicated low-powerCNN-based stereo matching SoC is required.

In this paper, we describe an accurate, low power (<10mW), and real-time(<33.3ms) 3D HGR processor for smart mobile devices with 3 key features: 1) apipelined CNN processing element (PE) with a shift MAC operation for highthroughput by maximizing core utilization; 2) triple ping-pong buffers withworkload balancing for fast line streaming by reducing external accesses; and 3)nearest-neighbor searching (NNS) processing-in-memory (PIM) for high energyefficiency by reducing the number of bitlines requiring pre-charge in SRAM.

Figure 13.4.2 shows the overall architecture of the HGR processor that consistsof a CNN-stereo engine (CSE) and an ICP-PSO engine (IPE). The CSE containstwo line-streaming CNN cores with 4 locally distributed memories and onematching core. The CNN core has one pipelined CNN PE and a local DMA with aforwarding/backwarding (FWD/BWD) unit to balance workloads between the CNNcores. The IPE consists of a NNS unit with 16-way parallel NNS PIMs and a hand-tracking unit.

Figure 13.4.3 shows the pipelined CNN PE architecture with shift MAC operation,which performs 1D convolution, for a line-streaming CNN. The entire 2-Dconvolution is performed by repetition of shift MAC. The shift MAC operation witha 3×3 filter in Fig. 13.4.3 consists of three stages: 1) shifting feature maps andfilters, 2) element-wise multiplication, and 3) partia-sum accumulation. First, inputfeature maps and filters are loaded into shift registers and both are shifted by 1-index in every clock cycle. Then, the active weights of each channel are multipliedwith active features element-by-element. Finally, multiplication results areaccumulated to obtain 1D convolution results in 3 cycles. The line-streaming CNNoperation is accelerated by the 7-stage pipelined CNN PE that processes 48 MACsper cycle with 96% core utilization. Moreover, the pipelined architecture enablesline-streaming processing, as well as memory access latency hiding to achieve1.80TOPS/W, 60MHz at 0.9V.

Memory management of the CNN core is shown in Fig. 13.4.4. The hardwareutilizes triple ping-pong memories to store feature maps, where each memory isaccessed simultaneously to feed pipeline inputs, write back pipeline outputs, andto access an external interface, respectively. Instead of storing the entire featuremaps on the chip, the line-streaming processing with only 3-to-5 lines of featuremaps reduces 90.1% of required data that must be fetched from/to off-chip. Asa result, the triple ping-pong operation hides the external access time behind CNNcomputation, and the hand tracking system does not need external accesses tofetch intermediate feature maps. In addition, the FWD/BWD unit balances

workloads between two CNN cores automatically. As shown in top-right of Fig.13.4.4, data in each core becomes unbalanced due to the reduced size of featuremaps after convolution, especially in a distributed memory architecture. The datatransaction time between CNN cores must be well defined to balance theirworkloads [4]. The FWD/BWD units keep CNN core workloads identicalthroughout CNN processing and, as shown in Fig. 13.4.4, exchange feature-mapboundary data with one another when local feature maps are fetched. Moreover,the internal data transaction time for workload balancing can be hidden behindthe CNN pipeline. As a result, the triple ping-pong buffers with the FWD/BWD unitreduce overall CNN processing time by 23.9%

Figure 13.4.5 shows the PIM architecture specialized for NNS to track a user’shands in the IPE. Hand tracking requires >360K-node k-d tree NNS between the46-sphere model in the memory and the depth input from the CSE. In theproposed PIM, NNS operation is composed of 2 half cycles such that it performsNNS on a parent in the first half cycle, 1 read operation for fetching next parentaddress, and on a selected child in the remaining half cycle. The proposed PIM iscomposed of 4 cell arrays (CA) for a parent node, two child nodes, and a nextsearching address. Each CA has 36×16 8T-SRAM cells and the bitlines areseparated as read bitlines (RBLs) and write bitline (WBLs). The 3 CAs for a parentand 2 child nodes contain ripple-carry comparators that output “COUT = Sign(WBL-RBL) OR CIN” to the comparison bitline (CBL). A parent CBL is connected to anaddress decoder to activate the selected child CA by changing LSB of the addressdecoder in later half cycle of NNS. The proposed PIM achieves 6× speed-upcompared with the conventional SRAM design, which requires 6 operations tocomplete NNS. Moreover, it can skip read operations by 1/3 so that redundantpre-charging power consumption on global bitlines can be reduced by 63.9%(2.8× energy-efficiency enhancement).

Measurement results of hand-depth sensing and hand tracking in the 3D HGRprocessor are shown in Fig. 13.4.6. Thanks to the CNN stereo, the processor canacquire accurate hand depth showing distinguishable depth of fingertips and lowdisparity error. In 20cm-to-40cm active range, the average hand tracking accuracyis 4.3mm with 5cm separated VGA stereo cameras, and it achieves mm-scaleaccuracy of 3D HGR. In addition, the proposed processor consumes 9.02mW @50MHz, 0.85V for real-time 1-hand 3D HGR, which is 14× less than the state-of-the-art UI processor [2]. As a result, the 3D HGR processor can satisfy therequired power budget (<10mW) with 30ms latency. Moreover, the CNN-stereoengine achieves 1.80TOPS/W which is 1.45× more energy efficient than a state-of-the-art distributed memory architecture [4]. The line-streaming CNNarchitecture uses 781.5KB on-chip memory, while [4-6] needed ~MB memory,which is impossible to realize as on-chip memory.

The 3D HGR processor for smart mobile devices is fabricated in 65nm CMOStechnology, and it occupies 4×4mm2 integrating 781.5KB of SRAM. Its maximumand average hand tracking error are 10.6mm and 4.3mm, respectively, where thatof a ToF system is ~5mm pm average. The highly accurate 3D HGR processorconsumes only 9.02mW with 30ms system latency.

References:[1] Hololens Hardware detail. Available: https://developer.microsoft.com/en-us/windows/mixed-reality/hololens_hardware_details[2] S. Park, et al., "A 126.1mW Real-Time Natural UI/UX Processor with EmbeddedDeep-Learning Core for Low-Power Smart Glasses," ISSCC, pp. 254-255, 2016.[3] W. Luo, et al., "Efficient Deep Learning for Stereo Matching," CVPR, pp. 5695-5703, 2016.[4] K. Bong, et al., "A 0.62mW Ultra-Low-Power Convolutional-Neural-NetworkFace-Recognition Processor and a CIS Iintegrated with Always-On Haar-Like FaceDetector," ISSCC, pp. 248-249, 2017.[5] Y. H. Chen, et al., "Eyeriss: An Energy-Efficient Reconfigurable Accelerator forDeep Convolutional Neural Networks," ISSCC, pp. 262-263, 2016.[6] D. Shin, et al., "DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processorfor General-Purpose Deep Neural Networks," ISSCC, pp. 240-241, 2017.

978-1-5090-4940-0/18/$31.00 ©2018 IEEE



Figure 13.4.1: 3D hand-gesture recognition in mobile smart devices. Figure 13.4.2: Overall architecture.

Figure 13.4.3: Pipelined CNN PE architecture with shift MAC.

Figure 13.4.5: Nearest-neighbor searching processing-in-memory (PIM). Figure 13.4.6: Measurement results and comparison table.

Figure 13.4.4: Triple ping-pong buffer architecture with workload balancing.

13



Figure 13.4.7: Chip photography and performance summary.



13.5 An Always-On 3.8μJ/86% CIFAR-10 Mixed-Signal Binary CNN Processor with All Memory on Chip in 28nm CMOS

Daniel Bankman1, Lita Yang1, Bert Moons2, Marian Verhelst2,Boris Murmann1

1Stanford University, Stanford, CA; 2KU Leuven, Leuven, Belgium

The trend of pushing deep learning from cloud to edge due to concerns of latency,bandwidth, and privacy has created demand for low-energy deep convolutionalneural networks (CNNs). The single-layer classifier in [1] achieves sub-nJoperation, but is limited to moderate accuracy on low-complexity tasks (90% onMNIST). Larger CNN chips provide dataflow computing for high-complexity tasks(AlexNet) at mJ energy [2], but edge deployment remains a challenge due to off-chip DRAM access energy. This paper describes a mixed-signal binary CNNprocessor that performs image classification of moderate complexity (86% onCIFAR-10) and employs near-memory computing to achieve a classification energyof 3.8μJ, a 40× improvement over TrueNorth [3]. We accomplish this using (1)the BinaryNet algorithm for CNNs with weights and activations constrained to +1/-1 [4], which drastically simplifies multiplications (XNOR) and allows integratingall memory on-chip; (2) an energy-efficient switched-capacitor (SC) neuron thataddresses BinaryNet’s challenge of wide vector summation; (3) architecturalparallelism, parameter reuse, and locality.

Figure 13.5.1 illustrates the function and network topology of our design. Byenforcing structural regularity, we allow the physical architecture to maximallyexploit the locality of the CNN algorithm. Each CNN layer carries out a multi-channel, multi-filter convolution. The number of filters in each convolutional layeris restricted to 256, the filter size is 2×2, and the number of channels is 256. Thecircuit benefits of this regularity are short wires and arrayed, low fan-out de-multiplexers, which minimize path loading between memory and logic.

Figure 13.5.2 shows the top-level architecture, which supports up to 9 layers witha customized instruction set for input-output actions, CNN and fully-connected(FC) layers. The processor reads an RGB image, converts the channels to 85-levelthermometer codes, and stacks them into a 256-channel image as the CNN input.At the output, an FC layer digitally computes the 4b class label. For a CNN layer,east and west SRAM banks alternate roles between input and output in a ping-pong fashion. These SRAM banks are 256b wide, each word representing a256-channel pixel. Computation of a filter is completed inside a neuron, eliminatingpartial sums. The weights are transferred from SRAM to local neuron memory(latches) and reused, while the filter traverses the image. A data-parallel array of64 neurons processes a patch of the input image, amortizing the input image SRAMread energy per filter computation by 64×. The input DEMUX block interfacesbetween SRAM (which loads a pixel) and the neuron array (which receives a patch).For the FC layer, weights are loaded from a separate SRAM bank 64 channels at atime, and the multiply-accumulate operation is performed sequentially in the digitaldomain.

Figure 13.5.3 shows how locality translates to reduced loading. The input DEMUXis an array of 1-to-4 de-multiplexers with output registers. Each pixel of the inputimage can be reused in the processing of two overlapping patches, amortizing theinput image SRAM read energy per filter computation by 2×. A 2-by-2 crossbarinterchanges pixel pairs at the neuron array input. Filter weights are transferredover a 4b per neuron bus, split into north and south halves to reduce the loadingof weight transfers by 2×. To minimize neuron array to memory wiring, each neuronwrites to the same 4 output channels (1 per filter group) in each CNN layer, allowingimplementation of the output DEMUX block as an array of 1-to-4 de-multiplexers.Max pooling occurs incrementally during convolution by first reading a bit in theoutput image SRAM, and then writing back its logical OR with a neuron output.

Figure 13.5.4 shows the neuron schematic, which computes the weighted sum ofa filter with a patch of the input image. With memory energy amortized byparallelism and reuse, and multiplication reduced to XNOR, high-fan-in additionbecomes the main bottleneck. However, in the employed SC neuron, the energycost of addition is reduced by the small voltage swing at the charge conservationnode. In contrast, a digital adder tree would involve rail-to-rail swings along itsstages and exhibits a larger amount of switched capacitance. The neuron’sdominant noise source is the comparator, but its energy cost is amortized over1024 weights and the CNN can tolerate some noise. As a result, the SC neuron isamenable to low-voltage operation, and uses a combined 0.6V digital supply/analogreference and a 0.8V comparator supply. Because the SC neuron performs aweighted sum with data-dependent switching (apart from the comparator), itsenergy scales with activity, like static CMOS. The SC neuron uses a capacitive DAC

(CDAC) with four sections: a 1024b thermometer section for applying a filter, abinary-weighted section for the neuron’s bias, a threshold section (comparator),and a common-mode (CM) setting section to compensate for parasitics at thecharge conservation node. Comparator offset is digitized using calibration atstartup, stored in a local register, and subtracted from the bias loaded from SRAMduring weight transfer. In environments where large temperature changes mayinduce significant offset drift, calibration can be performed periodically (e.g. onceper second) at negligible cost in average energy per classification and throughput.Behavioral Monte Carlo simulations were run to determine the amount ofcomparator noise, offset, and unit-capacitor mismatch that the CNN can toleratewithout degradation in classification accuracy, resulting in a comparator designedfor 4.6mV offset and a 1fF unit capacitor. Because the voltage representing theweighted sum is developed at the charge conservation node, top and bottom plateparasitics do not affect linearity. During convolution, the CDAC is periodicallycleared (sampling 0V) as required by leakage at the top plates. To prevent drawingexcessive charge from the supply, the unit-capacitor bottom plate nodes aredischarged by asserting CLR before the top plate is discharged via CLRe. To preventasymmetric charge injection, the top plate switches are opened before the bottomplate voltages resume their values set by filter weights, image inputs, and biases.

Figure 13.5.5 shows the measured results at room temperature. Ten different chipswere measured to evaluate the accuracy spread due to thermal noise and mismatchin the SC neuron. At nominal supply voltages (VDD=VMEM=1.0V, VNEU=0.6V,VCOMP=0.8V), the chips operate up to 380frames/s (FPS) and achieve5.4μJ/classification. Lowering VDD and VMEM to 0.8V leads to 3.8μJ/classification(1.43× reduction) at 237 FPS. The mean classification accuracy is 86.05% (seehistogram), the same as observed in a perfect digital model. The histogram spreadis solely caused by the noise and mismatch in the SC neuron (which can notablylead to a higher classification accuracy than in the perfect digital model). The 95%confidence interval in mean classification accuracy is 86.01% to 86.10%, measuredover 10 chips, 30 runs each through the 10,000 image CIFAR-10 test set. Notincluded in these energy figures is the 1.8V chip I/O energy, which amounts to0.43μJ (a small fraction of the core energy).

To explore further energy savings, we reduced VDD to 0.6V and set VMEM to 0.53V,0.52V, 0.51V and 0.50V to show the impact of bit errors. The mean accuracydegrades to 85.7%, 85.2%, 84.2% and 82.5%, respectively. The large error barsfor the lower voltages (see Fig. 13.5.5) are due to SRAM VMIN variations across the10 chips. At VDD=0.6V and VMEM=0.53V (borderline practical), the chip consumes2.61μJ/classification, a 2.1× reduction versus nominal supplies. From thebreakdown in Fig. 13.5.5, we see that neuron energy increases due to leakage atthe lower FPS imposed by voltage scaling. However, this increase is smallcompared to the logic and memory savings.

Figure 13.5.6 compares this work with prior art and Fig. 13.5.7 shows a photo ofthe 2.44mm×2.44mm die. On the same benchmark dataset (CIFAR-10), we achieve40-60× improvement in energy per classification over [3], which does not exploitlocality and thus suffers from high interconnect activity. The binarized DNNaccelerator in [5] has all memory on-chip, but cannot exploit weight reuse,attaching the energy cost of an SRAM bit load with each XNOR operation. Thespiking LCA network in [6] exhibits low energy, but has a relatively low accuracyfor a lower-complexity task (MNIST).

Acknowledgements:Silicon fabrication was provided by the TSMC university shuttle program. We thankScott Liao for design support. This work was funded in part by Systems onNanoscale Information fabriCs (SONIC), one of the six STARnet Centers, sponsoredby MARCO and DARPA.

References:[1] J. Zhang, et al., “A Machine-learning Classifier Implemented in a Standard 6TSRAM Array,” IEEE Symp. VLSI Circuits, 2016.[2] Y. H. Chen, et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator forDeep Convolutional Neural Networks,” ISSCC, pp. 262-263, 2016. [3] S. Esser, et al., “Convolutional networks for fast, energy-efficient neuromorphiccomputing,” Proc. Natl. Acad. Sci. USA, vol. 113, no. 41, pp. 11441-11446, 2016.[4] M. Courbariaux, et al., “Binarized Neural Networks: Training Neural Networkswith Weights and Activations Constrained to +1 or -1,” arXiv preprint:1602.02830v3, 2016. [5] K. Ando, et al., “BRein Memory: A 13-Layer 4.2 K Neuron/0.8 M SynapseBinary/Ternary Reconfigurable in-Memory Deep Neural Network Accelerator in 65nm CMOS,” IEEE Symp. VLSI Circuits, 2017.[6] F. Buhler, et al., “A 3.43TOPS/W 48.9pJ/Pixel 50.1nJ/Classification 512 AnalogNeuron Sparse Coding Neural Network with On-Chip Learning and Classificationin 40nm CMOS,” IEEE Symp. VLSI Circuits, 2017.

978-1-5090-4940-0/18/$31.00 ©2018 IEEE



Figure 13.5.1: System design and binary CNN topology. Figure 13.5.2: Top-level architecture with 64 neurons.

Figure 13.5.3: Locality in logic design and physical architecture.

Figure 13.5.5: Measured energy and accuracy on CIFAR-10 imageclassification. Figure 13.5.6: Comparison to state of the art.

Figure 13.5.4: Switched-capacitor neuron using charge redistribution for widevector summation.

13



Figure 13.5.7: Die photo.



13.6 A 1.8Gb/s 70.6pJ/b 128×16 Link-Adaptive Near-Optimal Massive MIMO Detector in 28nm UTBB-FDSOI

Wei Tang1, Hemanth Prabhu2, Liang Liu2, Viktor Öwall2, Zhengya Zhang1

1University of Michigan, Ann Arbor, MI2Lund University, Lund, Sweden

This work presents a 2.0mm2 128×16 massive MIMO detector IC that provides21dB array gain and 16× multiplexing gain at the system level. The detectorimplements iterative expectation-propagation detection (EPD) for up to 256-QAMmodulation. Tested with measured channel data [1], the detector achieves 4.3dBprocessing gain over state-of-the-art massive MIMO detectors [2, 3], enabling2.7× reduction in transmit power for battery-powered mobile terminals. The ICuses link-adaptive processing to meet a variety of practical channel conditionswith scalable energy consumption. The design is realized in a condensed systolicarray architecture and an approximate moment-matching circuitry to reach1.8Gb/s at 70.6pJ/b. The performance and energy efficiency can be tuned over awide range by UTBB-FDSOI body bias.

Real-time detection for massive MIMO is compute-intensive and power-hungrydue to massive matrix dimensions and fast varying channels. Previous works [2,3] demonstrated low-complexity massive MIMO detectors based on independentand identically distributed (i.i.d.) channel assumption in massive MIMO. The i.i.d.channel assumption is impractical and these simplified detectors suffer fromsignificant performance losses when tested in measured massive MIMO channels,especially in cases of high user load. In designing a practical massive MIMOdetector, we select EPD that leverages iterative interference cancellation [4] tooffer near-optimal performance even in unfavorable channel conditions, while itscomplexity is limited to O(K3) per iteration, where K is the number of users. Weexploit EPD’s iterative processing to adapt the processing effort to the channelso to achieve the required BER at the lowest energy. The EPD design incorporatesexplicit matrix inversion, so it could be reused for both uplink and downlinkprocessing. Evaluated using measured massive MIMO channels, the EPDoutperforms a linear MMSE detector by 0.7dB, 4.3dB, and 3.5dB in i.i.d., non-line-of-sight (NLOS), and line-of-sight (LOS) conditions, respectively, as shownin Fig. 13.6.1.

The EPD architecture is shown in Fig. 13.6.2. The Gram and yMF memory buffersincoming channel and match-filtered uplink streams, and the memory supportsflexible access patterns required for reconfiguration. The MMSE parallelinterference cancellation (MMSE-PIC) filter cancels the inter-user interferencefrom the uplink user data. The moment-matching unit refines the symbolestimates by incorporating constellation information. The detection-control unitdynamically adjusts the per-iteration processing effort and detects earlyconvergence. Updated symbol estimates from the moment-matching unit arebuffered in the symbol estimate memory and fed back to the MMSE-PIC filter foriterative refinement. The architecture is configurable to support vector processingof different lengths to facilitate dynamic dimension reduction, i.e., when a batchof estimates are determined to be reliable, they are frozen and the correspondingusers are removed from future iterations. In Fig. 13.6.2, dynamic dimensionreduction enables 40-90% of complexity reduction due to the reduced number ofusers and iteration count. With appropriate threshold choices, the possibility ofpremature freezing is minimized and the SNR loss is negligible. In designing thesilicon prototype, we combine this adaptive architecture with coarse-grained clockgating to save 49.3% power.

One of the most compute-intensive and accuracy-critical parts of the EPD is thematrix inversion block in the MMSE-PIC filter. A systolic array is often used toimplement the LDL decomposition to realize highly accurate matrix inversion. Thesystolic array architecture features a regular architecture, efficient routing andsimple control. However, the hardware utilization of a systolic array architectureis only 33.3% [5] due to the need for zero-padding inputs. In this work, weimplement a condensed LDL systolic array, which merges under-utilized PEcircuitry to improve the hardware utilization to 90% for a 16×16 array, whilereducing the interconnect overhead by more than 70%. As shown in Fig. 13.6.3,a PE in a regular systolic array performs division (PE0), multiplication (PE1) orMAC (PE2 and 3) operations and passes its output to the neighboring PEs. In ourcondensed systolic array, every three PEs in a row are merged. The mergingshortens data movements in the systolic array. Rather than passing data alongmany stages of unused operations in a systolic array, our condensed array limits

data movements using holding buffers to maximize data reuse. The data reuse isespecially advantageous in our design, as it requires a relatively long 28b data bitwidth to support a wide range of channel conditions. The condensed arrayarchitecture reduces silicon area by 62% compared to the regular systolic array.Moreover, the condensed array shortens data movement delay and dedicates alarger fraction of a clock period to data processing.

The moment-matching unit computes the likelihood of each constellation pointto refine the mean and variance of current symbol estimates. The computationalcomplexity is proportional to the product of the modulation size and the numberof simultaneously served users. For a 256-QAM, 128×16 massive MIMO system,the complexity is prohibitive. We implement an approximate moment-matching(AMM) circuitry to cut 90% computation by sacrificing a limited, 0.5dB SNR loss.AMM makes the complexity independent of modulation size by exploiting thesymmetry of the QAM constellation in computing the mean and varianceestimates, thus the approach is favorable in designing a flexible detector thatsupports a wide range of modulation schemes. Complexity is further reduced witha piecewise linear approximation to compute mean and variance updates: themean update is reduced to a hard decision of the input soft symbol; and thevariance update is fitted into a first-order polynomial function of the input meanand variance. As shown in Fig. 13.6.4, compared to a brute-force moment-matching implementation using 2 dividers, 65 MACs, and 16 exponentialevaluations, the AMM circuitry uses only 2 MACs. AMM also eliminates costlyexponentiation and division, and reduces intermediate bit width requirements.The technique cuts the silicon area of the moment-matching unit by more than90%.

An EPD test chip is fabricated in ST 28nm UTBB-FDSOI technology, occupying2.0mm2 core area, as shown in Fig. 13.6.7. The measurement results at differentcore voltages and body biasing in room temperature are shown in Fig. 13.6.5. Ata nominal voltage of 1.0V, the EPD chip runs at 512MHz, delivering a systemthroughput of 1.6Gb/s. By applying forward body biasing of 0.4V, a maximumworking frequency of 569MHz is achieved, corresponding to an 11% boost indetection throughput to 1.8Gb/s. The corresponding core power consumption is127mW, translating to an energy efficiency of 70.6pJ/b. For a low-powerapplication, reverse body biasing of 0.2V and voltage scaling of 0.7V can beapplied to reduce the power consumption to 23.4mW at a throughput of 754Mb/s.Compared to the prior MIMO detector designs shown in Fig. 13.6.6, our EPD chipprovides flexibility in terms of modulation and channel adaptation, supports bothuplink and downlink processing, and achieves a high processing gain, whilemaintaining competitive energy and area efficiency. Note that the MPD chip in [2]takes advantage of the assumption of the diagonal dominance in i.i.d. channelsusing a low-complexity, 13b implementation without explicit matrix inversion.However, the MPD encounters an early error floor and fails to provide sufficientprocessing gain in practical but unfavorable channels such as LOS. Incomparison, our EPD chip obtains 4.3dB processing gain in highly correlatedchannels, equivalent to a 2.7× boost in link margin that can be utilized tosignificantly lower the TX power and relax the frontend requirements.

Acknowledgements:The work was supported in part by NSF CCF-1054270, Intel, Silicon Labs, andSystem Design on Silicon (SoS) Center. Chip fabrication donation was providedby STMicroelectronics. We thank Ove Edfors, Rakesh Gangarajaiah, BabakMohammadi, Shiming Song and Teyuh Chou for advice.

References:[1] S. Malkowsky, et al., “The World’s First Real-Time Testbed for Massive MIMO:Design, Implementation, and Validation,” IEEE Access, vol. 5, pp. 9073-9088,2017.[2] W. Tang, et al., “A 0.58mm2 2.76Gb/s 79.8pJ/b 256-QAM Massive MIMOMessage-Passing Detector,” IEEE Symp. VLSI Circuits, pp. 1-2, 2016.[3] H. Prabhu, et al., “A 60pJ/b 300Mb/s 128×8 Massive MIMO Precoder-Detectorin 28nm FD-SOI,” ISSCC, pp. 60-61, 2017.[4] J. Céspedes, et al., “Expectation Propagation Detection for High-Order High-Dimensional MIMO Systems,” IEEE Trans. Commun., vol. 62, no. 8, pp.2840-2849, 2014.[5] S. J. Bellis, et al., “Alternative Systolic Array for Non-Square-Root CholeskyDecomposition,” IEE Proc. Comput. Digit. Technol., vol. 144, no. 2, pp. 57-64,1997.[6] C.-H. Chen, et al., “A 2.4mm2 130mW MMSE-Nonbinary LDPC IterativeDetector-Decoder for 4x4 256-QAM MIMO in 65nm CMOS,” ISSCC, pp. 338-339,2015.

978-1-5090-4940-0/18/$31.00 ©2018 IEEE



Figure 13.6.1: A multi-user massive MIMO system and BER of differentchannels. Insets are the Gram matrices |HHH|.

Figure 13.6.2: Link-adaptive EPD architecture and efficiency gains fromdimension reduction and early termination.

Figure 13.6.3: Condensed LDL systolic array with enhanced utilization andmerged PE designs.

Figure 13.6.5: Measured frequency and power consumption for different corevoltages and body biases.

Figure 13.6.6: Comparison with state-of-the-art MIMO detectorimplementations.

Figure 13.6.4: Circuitry implementations and complexities of the original andapproximate moment-matching.

13



Figure 13.6.7: Chip features and microphotograph.



13.7 A 232-to-1996KS/s Robust Compressive-Sensing Reconstruction Engine for Real-Time Physiological Signals Monitoring

Ting-Sheng Chen, Hung-Chi Kuo, An-Yeu Wu

National Taiwan University, Taipei, Taiwan

Compressive sensing (CS) techniques enable new reduced-complexity designsfor sensor nodes and help reduce overall transmission power in wireless sensornetwork [1-2]. Prior CS reconstruction chip designs have been described in [3-4]. However, for real-time monitoring of physiological signals, the appliedorthogonal matching pursuit (OMP) algorithms they incorporate are sensitive tomeasurement noise interference and suffer from a slow convergence rate. Thispaper presents a new CS reconstruction engine fabricated in 40nm CMOS withfollowing features: 1) A sparsity-estimation framework to suppress measurementnoise interference at sensing nodes, achieving at least 8dB signal-to-noise ratio(SNR) gain under the same success rate for robust reconstruction. 2) A newflexible indices-updating VLSI architecture, inspired by the gradient descentmethod [5], that can support arbitrary signal dimension, (Lnew, M), of CSreconstruction with high sparsity level (Kmax). 3) Parallel-searching, indices-bypassing, and functional blocks that automatically group processing elements(PEs) are designed to reduce the total CS reconstruction cycle latency by 84%.Compared with prior state-of-the-art designs, this CS reconstruction engine canachieve 10× higher throughput rate and 4.2× better energy efficiency at theminimum-energy point (MEP).

In blind reconstruction algorithms that can be operated without a priori knowledgeof the sparsity level, such as OMP [3-4] and stochastic gradient pursuit (SGP)[5], measurement noise destroys the sparsity (K) of received signals, degradingthe reconstruction quality and speed [5]. The subspace pursuit (SP) algorithmoffers excellent recovery quality and convergence under noisy scenarios. However,it is classified as non-blind reconstruction as it needs an a priori explicit sparsitylevel. We propose a two-phase sparsity-estimation subspace pursuit (SE-SP) CSreconstruction algorithm as shown in Figure 13.7.1. The new SE-SP can copewith measurement noise through two phases. Phase-I (P1) performs blindreconstruction similar to OMP. It reaches the maximum chosen indices numbers,Kmax, in order to obtain all potential indices. Then, it estimates the effective sparsitylevel, K, according to the number of elements whose amplitudes are larger thana certain noise distortion level derived from the residual norm. Phase-II (P2)applies SP with the output of P1 and the estimated K to obtain the K-best sparsesolution. The SE-SP still performs blind reconstruction like OMP, but it possessesall of the advantages of the non-blind SP, such as the robustness to measurementnoise. Hence, it can achieve 8dB gain (time-sparse signals) in terms of successrate with only 10% iteration-count overhead and without any a priori sparsityinformation.

Figure 13.7.2 shows a least-mean-squares (LMS)-based architecture toimplement the SP algorithm in our design. It benefits from local data updating,thus is free of global communication overhead and wiring costs. To implementthe SP algorithm, we need to add/estimate K supports through a least squares(LS) computation in each iteration. However, a direct implementation of LS isunable to simultaneously realize updating of multiple indices and amplitudes, andthe configurability features in [4] for handling variable measurement dimensions(M). Furthermore, it requires backward and forward substitution (BS, FS)operations, which results in additional global communication cost and highiteration counts. Although SGP uses LMS to enhance OMP, cache overhead andfixed step-size limit its scalability and convergence. This chip uses a global bufferto transpose the columns of chosen indices into an on-chip cache for the LMSupdating process, resulting in 5× area reduction of the cache. It approximates thetarget sparse solution (x) with the following advantages: 1) Speed up of the SPalgorithm: it can add arbitrary support, Lnew≤128, at each iteration; then, thesorting engine finds K-best solution for enhanced reconstruction quality. 2)Support of reconfigurable design: the line buffer-based feature of LMS isadjustable to arbitrary size (e.g., signal dimension of M and K), reaching 100%configurability with only 0.5% area overhead. 3) Local LMS updating enablingscalable designs: it relieves the limitation of global BS/FS operations. Hence, alarger signal sparsity level (K=256) and 3× higher clock rates can be achieved inthis chip.

Figure 13.7.3 shows the block diagram of overall chip architecture. To achievehigher area efficiency, the tasks of the SE-SP are mapped into a foldedarchitecture. 768KB of on-chip memory stores multiple sensing matrices forflexible reconstruction, which can be either single matrix for large or 4 matricesfor small signal dimension. The 256 configurable PEs, 192KB cache and multi-task buffers can reconstruct a sparse signal with 100% flexibility. A sorting engineenables the task of finding the K-best indices/support in Phase-II. The datarepresentation is 32b fixed point. It supports any integer of (N, M, K) up to (2048,512, 256), representing a larger sparsity level than prior art.

Figure 13.7.4 shows three architectural optimization techniques to reduceoperating cycle count. 1) Dynamic PE grouping: because the task of indexsearching (IS) features high complexity but low data-dependency, the architecturecan be either unfolded or folded according to the measurement size. With largerM, the PEs use multiple cycles to complete a correlation operation in IS. When Mis small, the PEs are grouped dynamically to reduce the total cycle count of IS by25% to 50%. 2) Chosen-indices bypassing: the chosen indices from pastiterations can be bypassed during IS. The sorting engine checks the chosenindices before loading the column from the sensing matrix, eliminatingunnecessary correlation operations and reducing cycle count by 10% whenperforming IS. 3) Parallel-sparsity estimation: P1 provides a sparsity-orderestimation for SP to screen support, rather than reconstructing signals directly.Therefore, this chip accelerates the P1 operation by choosing multiple (2-8)indices in each iteration, which helps to reduce the total iterations by 79%. Themulti-task buffer is also designed for transposing up to 8 chosen columns directly.The above optimization can effectively reduce 85% of total cycles, thus enhancingthroughput rate and energy efficiency by 6.3× for CS reconstruction.

Figure 13.7.5 shows chip measurement results. The measured MEP is at 40MHzunder 0.65V supply. Inspired by [4], we found that energy efficiency is nearby tothe MEP. Therefore, this chip reduces area costs by using a global VDD for bothlogic and memory. Since the SE-SP possesses a noise-tolerance feature, whenreconstructing physiological signals (ECG, EMG, EEG and PPG in ourmeasurements) under noisy conditions, it realizes at least 8dB higher SNR gainthan OMP-based designs, under the same reconstruction SNR (RSNR).

Figure 13.7.6 shows a comparison with state-of-the-art designs. This CSreconstruction engine can provide 232-1996KS/s for reconstructing physiologicalsignals of multiple patients, while offering full reconfigurability with 100%flexibility to support arbitrary signal dimensions (M, N), and robustness tomeasurement noise interference. By operating at a higher clock rate, but withfewer cycles, this chip achieves 7-19× throughput enhancement and 3-7× higherenergy efficiency compared with prior work. The power consumption is largerthan prior art due to the 3× higher operating frequency. The radar chart showsthat this chip supports larger sparsity level, better energy efficiency and a higherthroughput rate. Figure 13.7.7 shows the micrograph and chip summary. Inconclusion, the 3.06mm2 CS reconstruction engine can provide timelyphysiological signal reconstruction for data collected from CS-based wirelessbiosensors under noisy conditions, making intelligent patient monitoring a reality.

Acknowledgements:Thank to Prof. Chia-Hsiang Yang and Prof. Tsung-Te Liu for useful discussions.The authors would like to thank National Chip Implementation Center, Taiwan forsupport on chip fabrication and measurements. This work is supported byMinistry of Science and Technology of Taiwan under Grant MOST 106-2221-E-002-204-MY3. The first two authors contributed equally.

References:[1] A. Dixon, et al., “Compressed Sensing System Considerations for ECG andEMG Wireless Biosensors,” IEEE Trans. Biomed. Circuits Syst., vol. 6, no. 2, pp.156-166, 2012.[2] D. Gangopadhyay, et al., “Compressed sensing analog front-end for bio-sensorapplications”, IEEE JSSC, vol. 49, no. 2, pp. 426-438, 2014.[3] Y. -C. Cheng, et al., “Matrix-Inversion-Free Compressed Sensing with VariableOrthogonal Multi-Matching Pursuit Based on Prior Information for ECG Signals,”IEEE Trans. Biomed. Circuits Syst., vol. 10, no. 4, pp. 864-873, 2016.[4] F. Ren, et al., “A configurable 12-to-237KS/s 12.8mW sparse-approximationengine for mobile ExG data aggregation,” ISSCC, pp. 334-335, 2015.[5] Y. M. Lin, et al., “Low-Complexity Stochastic Gradient Pursuit Algorithm andArchitecture for Robust Compressive Sensing Reconstruction,” IEEE Trans. onSignal Process., vol. 65, no. 3, pp. 638-650, 2017.

978-1-5090-4940-0/18/$31.00 ©2018 IEEE

ˆ

ˆ ˆ

ˆ

ˆ



Figure 13.7.1: Proposed two-phase sparsity-aware reconstruction. Figure 13.7.2: LMS architecture for implementing ℓ2 minimization.

Figure 13.7.3: Block diagram of chip architecture.

Figure 13.7.5: Measured results: Shmoo measurement, power consumptions,and reconstruction quality vs. SNR. Figure 13.7.6: Comparison with prior chip implementations.

Figure 13.7.4: Architecture-level optimization.

13



Figure 13.7.7: Chip micrograph and summary.

session 13 overview: machine learning and signal processing · this session covers trends in...

Documents