naresh shanbhag university of illinois at urbana-champaign ... · design principles for low...
TRANSCRIPT
A4H Field Day, Sandia National Labs, April 18, 2019
Deep In-memory Architectures
Naresh ShanbhagUniversity of Illinois at Urbana-Champaign
Sandia Collaborator – Craig Vineyard
Naresh Shanbhag, University of Illinois at Urbana-Champaign
• machines can beat humans in specific cognitive tasks
• game of Go is complex → huge search space: ~250&'( (Go) vs. ~35*( (Chess)
• AlphaGo machine: 1202 CPUs+176 GPUs
• HUGE Energy Cost ~10,000× more than human brain
The Energy Cost of Intelligence
The Economist, March 2016
Naresh Shanbhag, University of Illinois at Urbana-Champaign
fundamental question
how do we design intelligent machines that
operate at the limits of energy-delay-accuracy?
2
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Communication Receiver
The Brain
stochastic channel stochastic neural fabric
STOCHASTIC COMPONENTS
+
STATISTICAL MODELS of
COMPUTATION
Operating Machines @ the Limits(Energy-Latency-Accuracy)
How to do this in a principled manner?
make errors (for efficiency) then correct (for accuracy)
3
Naresh Shanbhag, University of Illinois at Urbana-Champaign
• Proceedings of IEEE, Special Issue on Non von Neumann Computing, January 2019.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Shannon-Inspired StatisticalComputing for theNanoscale Era
By NARESH R. SHANBHAG , Fellow IEEE, NAVEEN VERMA, Member IEEE,
YONGJUNE KIM , Member IEEE, AMEYA D. PATIL, Student Member IEEE,AND LAV R. VARSHNEY , Senior Member IEEE
ABSTRACT | Modern day computing systems are based on the
von Neumann architecture proposed in 1945 but face dual
challenges of: 1) unique data-centric requirements of emerging
applications and 2) increased nondeterminism of nanoscale
technologies caused by process variations and failures. This
paper presents a Shannon-inspired statistical model of com-
putation (statistical computing) that addresses the statistical
attributes of both emerging cognitive workloads and nanoscale
fabrics within a common framework. Statistical computing is a
principled approach to the design of non-von Neumann archi-
tectures. It emphasizes the use of information-based metrics;
enables the determination of fundamental limits on energy,
latency, and accuracy; guides the exploration of statistical
design principles for low signal-to-noise ratio (SNR) circuit
fabrics and architectures such as deep in-memory architecture
(DIMA) and deep in-sensor architecture (DISA); and thereby
provides a framework for the design of computing systems that
approach the limits of energy efficiency, latency, and accuracy.
From its early origins, Shannon-inspired statistical computing
has grown into a concrete design framework validated exten-
sively via both theory and laboratory prototypes in both CMOS
and beyond. The framework continues to grow at both of these
levels, yielding new ways of connecting systems through archi-
Manuscript received March 11, 2018; revised July 5, 2018; acceptedSeptember 1, 2018. This work was supported by Systems on NanoscaleInformation fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored bythe Microelectronics Advanced Research Corporation (MARCO) and the DefenseAdvanced Research Projects Agency (DARPA). (Corresponding author:Naresh R. Shanbhag.)N. R. Shanbhag, Y. Kim, A. D. Patil, and L. R. Varshney are with theCoordinated Science Laboratory, University of Illinois at Urbana–Champaign,Urbana, IL 61801 USA (e-mail: [email protected]).N. Verma is with the Department of Electrical Engineering, PrincetonUniversity, Princeton, NJ 08544 USA.
Digital Object Identifier 10.1109/JPROC.2018.2869867
tectures, circuits, and devices, for the semiconductor roadmap
to march into the nanoscale era.
KEYWORDS | Artificial intelligence; computing; informa-
tion theory; machine learning; nanoscale devices; statistical
computing
I. I N T R O D U C T I O N
The invention of the Turing machine [1] (1937) andits realization via the von Neumann architecture [2](1945) helped spawn the modern information age. Thediscovery of the transistor by Bardeen, Brattain, andShockley (1948), and subsequently, the integrated circuitby Kilby (1954), provided an efficient and cost-effectivesubstrate to realize complex computing systems. Sincethen, computing systems, through various generations ofmicroarchitecture and semiconductor technology [Moore’sLaw [3] (1965)] have fueled growth in nearly all industriesand transformed health, energy, transportation, education,finance, entertainment, and leisure.
Computing systems, however, now face two major chal-lenges. The first is the emergence of nondeterminism innanoscale fabrics due to increased stochasticity (variationsand failures) as well as diminishing energy-delay benefitsvia CMOS scaling. Stochasticity in nanoscale fabrics iscontrary to the deterministic Boolean switch required bythe traditional von Neumann architecture. While manybeyond-CMOS devices have been invented, none surpassthe silicon MOSFET under metrics for a deterministicswitch [4], [5]. The second is the emergence of data-centric cognitive applications that emphasize exploration-,learning-, and association-based computations using sta-tistical representations and processing of massive datavolumes [6]–[8]. The data-centric nature of these applica-
0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
PROCEEDINGS OF THE IEEE 1
4
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Shannon-inspired Model of Computing
use information-based metrics e.g., mutual information !(#$; &#)1
design low SNR fabrics, e.g., deep in-memory architecture (DIMA)2
develop statistical error-compensation (SEC) techniques3
(noisy channel)
DIMA
5
[ISSCC’18]
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Timeline
2014 DIMA concept paper (ICASSP)[UIUC]
2016 AdaBoost DIMA IC (VLSI)[Princeton]
2017 Multifunction & Random Forest DIMA ICs (ESSCIRC)[UIUC]
2018 In-memory Special Session/Forums (ISSCC,VLSI,ICCAD)
SGD-SVM DIMA IC (ISSCC)[UIUC] Scalable DIMA IC (VLSI)[Princeton]
In-memory ICs (ISSCC) [MIT,NTHU,NTHU-ASU]
DIMA US Patent [UIUC]
In-memory ICs (VLSI)[Columbia-ASU]
In-memory ICs (ISSCC)[UMich, NTHU-ASU-GIT, SU(China)], ISSCC Forum, CICC Forum ………………………….
2019
6
Naresh Shanbhag, University of Illinois at Urbana-Champaign
The Deep In-memory Architecture
7
[Shanbhag (UIUC)(ICASSP’14,3xJSSC’18); Verma (Princeton) (VLSI’16,JSSC’17,VLSI’18)]
Naresh Shanbhag, University of Illinois at Urbana-Champaign
The Energy Cost of Data Movement
[Horowitz, ISSCC’14]
Top-1accuracy[%
]
Operations[G-Ops]
BN-AlexNet
AlexNet
BN-NIN
ENet
Inception-v3
Inception-v4
ResNet-152
VGG-16 VGG-19
155M35M
40050
80
#ofparameters
!"#"!"$%
≈ ~())× (SRAM) → ~,))× (DRAM) → ~()))× (Flash)
Computation Energy (45nm)Memory Energy (45nm)
[Canziani, Arxiv’16]
8
BLP BLP BLP BLP BLP BLP
Cross Bitline Processor
Residual Digital Unit
Precharge/Column Mux/Y-DEC
X-D
EC
X-D
EC
inference/decisions
bitline processing (BLP)(SIMD analog processing)
multi-row functional READ(reads multiple-bits/row/precharge)
cross bitline processing (CBLP)(analog averaging enhances SNR)
reduced complexity, bandwidth-efficient digital
output
analog, mixed-signallow-SNR processing
①
②③
④
standard 6T SRAM BCA(preserves storage density)
The DIMA Approach - read functions of data, never the raw data
Naresh Shanbhag, University of Illinois at Urbana-Champaign
• matrix-vector product/read cycle
• column-major format
• multi-row access/read cycle
• modulated access pulses
DIMA’s 1st Stage - Functional READ
3
(a)
(b)
T0T1=2T0T2=4T0T3=8T0
∆VBL(W)VBL
VWL3
VWL2
VWL1
VWL0
8∆Vlsb
4∆Vlsb 2∆Vlsb ∆Vlsb
w3 w2 w1 w0(c)
Fig. 1: Deep in-memory architecture (DIMA) [7], [9], [11]: (a)single-bank showing functional read (FR), bitline processing(BLP), cross BLP (CBLP) and sense amplifier (SA), wherewordline (WL) drivers shaded in RED are turned on simultane-ously, (b) column architecture and bitcell, and (c) idealized FRusing pulse-width modulated (PWM) WL access pulses duringa 4-bit word (W = 0000b0) read-out. The WL pulses can beoverlapped in time but are shown as being non-overlapped toenhance clarity.
A. DIMA Processing Stages
The DIMA employs standard SRAM BCA and read cir-cuitry (at the bottom) to preserve conventional read func-tionality in normal read mode as shown in Fig. 1(a), wherewrite circuitry is omitted for simplicity. On the other hand,in-memory computing mode processes both memory accessand computation via embedded analog processors (at the top).For this mode, the B bits of the scalar weight W are pre-stored in a column-major format vs. row-major used in theconventional read mode. DIMA based on a Nrow⇥Ncol BCAhave following features:
• Multi-row functional read (FR): fetches a word W byreading a function of B rows per BL precharge (readcycle) from each column (Fig. 1(c)). Thus, Ncol wordsare read simultaneously per read cycle.
• BL processing (BLP): computes SD between the W andX per column with an analog BLP block operating withlow voltage swing based on charge-transfer mechanism[9]. The Ncol BLP blocks operate in parallel in a single-instruction multiple data (SIMD) manner.
• Cross BL processing (CBLP): aggregates the Ncol analogSD results from BLP blocks by charge-sharing to obtainthe VD.
• Analog to digital converter (ADC) and residual digitallogic (RDL): processes a thresholding/decision functionf( ) and other miscellaneous functions.
The DIMA is well-matched to the data-flow (Table I) intrinsicto commonly encountered ML algorithms. The benefits ofDIMA are described next.
B. Energy and Throughput Benefits
The throughput of memory-intensive ML inferences areoften limited by memory access as the complexity of compu-tation is low. In this case, the computation can be processedduring the memory access of the next data in the pipelinedfashion. The SRAM in the conventional digital system requiresan L : 1 column mux ratio (typical L = 4–32) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), whichlimits the number of bits per access to Ncol/L in standardSRAM compared to NcolB in DIMA’s FR. Therefore, DIMAneeds LB times fewer precharge cycles to read the samenumber of bits leading to throughput gains. As a result,throughput improvement factor S by DIMA can be expressedas follows:
S = LB(TCONV
TDIMA) (5)
where the TCONV is the delay for single read access in theconventional system, and TDIMA is the combined delay for thesingle FR and following BLP and CBLP stages. The TDIMAis larger than TCONV by a factor of 4 - 6 due to the accessof multiple rows and subsequent analog computations. Hence,S = 3.2-12.8 is easily achievable, e.g., for typical values ofB = 4, L = 4, 8, 16, and TDIMA = 5TCONV.
The DIMA also achieves energy savings due to the fewerprecharge cycles and low-swing computations. The energy
3
(a)
VWL0
d0
d1
CWL
CWL
VBL
VPRE
6T SRAM bitcell
VWL1
Prech
6T SRAM bitcell
VWL0VWL0
d0 d0
CBL CBL
VBLB
(b)
(c)
Fig. 1: Deep in-memory architecture (DIMA) [7], [9], [11]: (a)single-bank showing functional read (FR), bitline processing(BLP), cross BLP (CBLP) and sense amplifier (SA), wherewordline (WL) drivers shaded in RED are turned on simultane-ously, (b) column architecture and bitcell, and (c) idealized FRusing pulse-width modulated (PWM) WL access pulses duringa 4-bit word (W = 0000b0) read-out. The WL pulses can beoverlapped in time but are shown as being non-overlapped toenhance clarity.
A. DIMA Processing Stages
The DIMA employs standard SRAM BCA and read cir-cuitry (at the bottom) to preserve conventional read func-tionality in normal read mode as shown in Fig. 1(a), wherewrite circuitry is omitted for simplicity. On the other hand,in-memory computing mode processes both memory accessand computation via embedded analog processors (at the top).For this mode, the B bits of the scalar weight W are pre-stored in a column-major format vs. row-major used in theconventional read mode. DIMA based on a Nrow⇥Ncol BCAhave following features:
• Multi-row functional read (FR): fetches a word W byreading a function of B rows per BL precharge (readcycle) from each column (Fig. 1(c)). Thus, Ncol wordsare read simultaneously per read cycle.
• BL processing (BLP): computes SD between the W andX per column with an analog BLP block operating withlow voltage swing based on charge-transfer mechanism[9]. The Ncol BLP blocks operate in parallel in a single-instruction multiple data (SIMD) manner.
• Cross BL processing (CBLP): aggregates the Ncol analogSD results from BLP blocks by charge-sharing to obtainthe VD.
• Analog to digital converter (ADC) and residual digitallogic (RDL): processes a thresholding/decision functionf( ) and other miscellaneous functions.
The DIMA is well-matched to the data-flow (Table I) intrinsicto commonly encountered ML algorithms. The benefits ofDIMA are described next.
B. Energy and Throughput Benefits
The throughput of memory-intensive ML inferences areoften limited by memory access as the complexity of compu-tation is low. In this case, the computation can be processedduring the memory access of the next data in the pipelinedfashion. The SRAM in the conventional digital system requiresan L : 1 column mux ratio (typical L = 4–32) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), whichlimits the number of bits per access to Ncol/L in standardSRAM compared to NcolB in DIMA’s FR. Therefore, DIMAneeds LB times fewer precharge cycles to read the samenumber of bits leading to throughput gains. As a result,throughput improvement factor S by DIMA can be expressedas follows:
S = LB(TCONV
TDIMA) (5)
where the TCONV is the delay for single read access in theconventional system, and TDIMA is the combined delay for thesingle FR and following BLP and CBLP stages. The TDIMAis larger than TCONV by a factor of 4 - 6 due to the accessof multiple rows and subsequent analog computations. Hence,S = 3.2-12.8 is easily achievable, e.g., for typical values ofB = 4, L = 4, 8, 16, and TDIMA = 5TCONV.
The DIMA also achieves energy savings due to the fewerprecharge cycles and low-swing computations. The energy
∆"#$ =&'())*#$
+,-.
/012,3,
Per-column dot-product
10
fundamental energy-delay vs. SNR trade-off
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Bit-line Analog Processors
WL3
Precharge_b
d3
d2
d0
WL2
WL0
d1
WL1
ReplicaCell
MAX
BL BLB
RWL3~0
∆VBL∝D+P ∆VBLB∝D+P
····
D
P
W0W1
W255W254
W0W1
W255W254
W0W1
W255W254
W0W1
W255W254
BL0
BL1
BL254
BL255
Precharge Precharge Precharge Precharge
WLN(T)
DW0,N
DW0,N-1
DW0,N-3
WLN-1(T/2)
WLN-3(T/8)
DW0,N-2
WLN-2(T/4)
ø1 ø1 ø1 ø1
ø2 ø2 ø2ø2 Sum
BLP
cross BLprocessor
(CBLP)
Kernel BLP CBLP
Manhattandistance
subtract-compare aggregation
Euclideandistance
subtract-square aggregation
Dotproduct multiply weightedaggregation
Hammingdistance XOR aggregation
charge-redistribution circuits
15.4!m
6Tbitcell
2.11!m
0.915!m
BLprocessor
(BLP)
tight pitch-matching constraints for BLPs
subtractor
11
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Pitch-matching in BL Processorsarea overhead < 10%
(512x256 BCA)BLP area overhead reduces with BCA size
12
Naresh Shanbhag, University of Illinois at Urbana-Champaign
6T SRAM DIMA Prototypes
[JSSC’18] [ESSCIRC’17JSSC’18]
[ISSCC’18,JSSC’18]
[VLSI’16, JSSC’17]
8b compute in 65nm CMOS (UIUC)
100X EDP reduction over von Neumann equivalent* @ iso-accuracy
Testblock
1.2 mm
1.2 m
m
TRAINER
SRAM BitcellArray
normal R/W Interface
BLP-CBLP
CTRL
ADC
64b
Bus
multi-functional random forest on-chip learning
AdaBoost
* 8b fixed-function digital architecture with identical SRAM size
13
[VLSI’18]
DNN/CNN
Verma (Princeton)
Naresh Shanbhag, University of Illinois at Urbana-Champaign
DIMA - A Shannon-inspired Architecture
14
Naresh Shanbhag, University of Illinois at Urbana-Champaign
DIMA vs. von Neumann Architecture
von Neumann
BCA(!)
peripherals
L:1 col. mux
per
iphe
rals
perip
herals
" = $
• fundamental EDP vs. SNR trade-off (no free lunch)
• system accuracy derived from its Shannon-inspired roots (affordable lunch)
DIMA → accurate inferences with stochastic computesvon Neumann → accurate inferences with deterministic computes
DIMA
&'()*'()
BCA(!)
+
" = $,- + /
01
noise
21
15
similar delay & energy per read cycle
but DIMA computesa matrix-vector multiply
Naresh Shanbhag, University of Illinois at Urbana-Champaign
!" variations in functional READ dominates
DIMA’s Compute SNR
FR r
ow d
ecod
er
FR r
ow d
ecod
er
ADC&
RDLBLP
SA SA SA
Mux & buffer
K-b bus
Input buffer ( )
Cross BL processor (CBLP)
Decision ( )
BLP BLP BLP BLP
L:1col. mux
L:1 col. mux
d0
d1
d2
d3
BLP
L:1col. mux
PrechargeWL
driver
(a) (b)
Figure 1: A comparison of: (a) DIMA showing functional read (FR), bitline processing (BLP),cross BLP (CBLP) and sense amplifier (SA), and (b) the conventional architecture. Wordline(WL) drivers shaded in RED are turned on simultaneously.
(a) (b)
Figure 2: Data-flow of typical inference algorithms: (a) data-flow diagram, and (b) computa-tions required in support vector machine (SVM), template matching (TM), k-nearest neighbor(k-NN), and matched filter (MF) algorithms.
24
3
the BCA size Nrow ⇥ Ncol, the BL precharge voltage VPRE,and the maximum BL swing �VBL,max are identical in botharchitectures.
Since the SRAM in the digital architecture requires an L : 1column multiplexer (typical L = 4, . . . , 16) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), thenumber of bits per read cycle is limited to Ncol/L comparedto NcolB in DIMA’s FR. Therefore, DIMA needs LB timesfewer read cycles to read the same number of bits. However,the read cycle time for DIMA is larger than that of the digitalarchitecture since DIMA’s read cycle includes both data readand compute functions via the FR, BLP and CBLP stages.Ignoring the compute delay of the digital architecture, we findthat the delay reduction factor in reading a fixed number ofbits is given by:
⇢d = LB(Tdigital
TDIMA) =
LB
↵(1)
where the Tdigital and TDIMA are the read cycle times ofthe digital architecture and DIMA, respectively, and ↵ =TDIMA/Tdigital ⇡ 3 or 6 (in 65 nm CMOS) depending on theBLP function being computed. Previous work [9] has shownthat DIMA can realize B 6 with B = 4 being comfortablyrealized. Hence, ⇢d = 5⇥-to-21⇥ is easily achievable withtypical values of B = 4, L = 4, 8, 16 and ↵ = 3.
The dominant sources of energy consumption in a SRAMarray are the dynamic energy consumed to precharge large BLcapacitances CBL every read cycle and due to leakage. Theenergy consumed in reading B bits in the digital architectureand DIMA can be expressed as:
Edigital = LBCBL�VBL,maxVPRE + Elk-digital (2)
EDIMA = �CBL�VBL,maxVPRE +Elk-digital
⇢d(3)
where CBL is the the BL capacitance, VPRE is the BL prechargevoltage, �VBL is the BL (or BLB) voltage discharge causedby WL activation, Elk-digital is the leakage energy of the digitalarchitecture, and 1 � < 2 is an empirical factor determinedby whether DIMA’s FR is a pure B-bit read or incorporates a2B-bit computation. The leakage energy of DIMA is reducedby a factor of ⇢d since the array can be placed into a standbymode after TDIMA duration.
Since the first term in both (2) and (3) is the dominantcomponent of the energy consumption during active mode(CBL is the largest node capacitance in either architecture),their ratio provides the energy reduction factor ⇢e as follows:
⇢e =EDIMA
Edigital=
LB
�(4)
with values for ⇢e = 8⇥-to-32⇥ is easily achievable fortypical values of B = 4, L = 4, 8, 16 and � = 2.
Hence, from (1) and (4), the EDP reduction over a digitalarchitecture enabled by DIMA is given by:
⇢edp = ⇢e⇢d =(LB)2
�↵(5)
which ranges from 2.6⇥-to-2048⇥ of which the prototype ICsin [?] has achieved 100⇥ in the laboratory. This clearly indi-cates that there is significant room to improve upon DIMA’s
0 0.1 0.2 0.3 0.4 0.5 0.6
VBL (V)
0
2
4
6
8
10
12
14
16
18
20
Ener
gy p
er w
ord
(pJ)
Conv. (NROW = 256)Conv. (NROW = 512)Conv. (NROW = 1024)DIMA (NROW = 256)DIMA (NROW = 512)DIMA (NROW = 1024)
Fig. 3: Defining noise and distortion in DIMA.
EDP gains achieved thus far. We identify different methods toachieve this gain in later sections. [I HOPE WE CAN SHOWTHIS]. If the digital architecture’s compute energy were to beincluded in the above analysis, we would find that DIMA’slow-swing analog computation is approximately 10⇥ lowerthan that of the digital architecture.
The plots of the energy models (2)-(3) in Fig. 2 shows thatenergy consumption increases linearly with �VBL and Nrowfor both architectures due to the increased precharge energy.However, DIMA achieves aggressive energy efficiency byamortizing the precharge energy over B⇥Ncol bits comparedto Ncol/L bits in the digital architecture. Figure 2 also showsthat the silicon measured energy consumption reported in[9] for Nrow = 512 and �VBL = 500mV closely matchesthe values obtained from the energy models (2)-(3) (11%modeling error) further validating the energy-delay analysisin this section.
C. Noise and Distortion ModelsThough DIMA provides significant gains in EDP this comes
at the expense of the SNR of its analog computations [?].To see this trade-off, consider the computation of a N -dimensional dot product shown below:
y = WTX =NX
i=1
wixi (6)
where W = [w1, w2, . . . , wN ]T and X = [x1, x2, . . . , xN ]T
are two N -dimensional real-valued vectors of precision BW
and BX , respectively. The dot product computation in (6)is pervasive and most compute-intensive in machine learningalgorithms.
In DIMA, the dot product in (6) is modified to:
y =NX
i=1
[(wi + ⌘f,i)xi + ⌘b,i] + ⌘c (7)
where �VBL,i = wi + ⌘f,i and �VBLP,i = (wi + ⌘f,i)xi + ⌘b,iare the observed noisy outputs of the FR and BLP stages of
3
the BCA size Nrow ⇥ Ncol, the BL precharge voltage VPRE,and the maximum BL swing �VBL,max are identical in botharchitectures.
Since the SRAM in the digital architecture requires an L : 1column multiplexer (typical L = 4, . . . , 16) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), thenumber of bits per read cycle is limited to Ncol/L comparedto NcolB in DIMA’s FR. Therefore, DIMA needs LB timesfewer read cycles to read the same number of bits. However,the read cycle time for DIMA is larger than that of the digitalarchitecture since DIMA’s read cycle includes both data readand compute functions via the FR, BLP and CBLP stages.Ignoring the compute delay of the digital architecture, we findthat the delay reduction factor in reading a fixed number ofbits is given by:
⇢d = LB(Tdigital
TDIMA) =
LB
↵(1)
where the Tdigital and TDIMA are the read cycle times ofthe digital architecture and DIMA, respectively, and ↵ =TDIMA/Tdigital ⇡ 3 or 6 (in 65 nm CMOS) depending on theBLP function being computed. Previous work [9] has shownthat DIMA can realize B 6 with B = 4 being comfortablyrealized. Hence, ⇢d = 5⇥-to-21⇥ is easily achievable withtypical values of B = 4, L = 4, 8, 16 and ↵ = 3.
The dominant sources of energy consumption in a SRAMarray are the dynamic energy consumed to precharge large BLcapacitances CBL every read cycle and due to leakage. Theenergy consumed in reading B bits in the digital architectureand DIMA can be expressed as:
Edigital = LBCBL�VBL,maxVPRE + Elk-digital (2)
EDIMA = �CBL�VBL,maxVPRE +Elk-digital
⇢d(3)
where CBL is the the BL capacitance, VPRE is the BL prechargevoltage, �VBL is the BL (or BLB) voltage discharge causedby WL activation, Elk-digital is the leakage energy of the digitalarchitecture, and 1 � < 2 is an empirical factor determinedby whether DIMA’s FR is a pure B-bit read or incorporates a2B-bit computation. The leakage energy of DIMA is reducedby a factor of ⇢d since the array can be placed into a standbymode after TDIMA duration.
Since the first term in both (2) and (3) is the dominantcomponent of the energy consumption during active mode(CBL is the largest node capacitance in either architecture),their ratio provides the energy reduction factor ⇢e as follows:
⇢e =EDIMA
Edigital=
LB
�(4)
with values for ⇢e = 8⇥-to-32⇥ is easily achievable fortypical values of B = 4, L = 4, 8, 16 and � = 2.
Hence, from (1) and (4), the EDP reduction over a digitalarchitecture enabled by DIMA is given by:
⇢edp = ⇢e⇢d =(LB)2
�↵(5)
which ranges from 2.6⇥-to-2048⇥ of which the prototype ICsin [?] has achieved 100⇥ in the laboratory. This clearly indi-cates that there is significant room to improve upon DIMA’s
0 0.1 0.2 0.3 0.4 0.5 0.6
VBL (V)
0
2
4
6
8
10
12
14
16
18
20
Ener
gy p
er w
ord
(pJ)
Conv. (NROW = 256)Conv. (NROW = 512)Conv. (NROW = 1024)DIMA (NROW = 256)DIMA (NROW = 512)DIMA (NROW = 1024)
Fig. 3: Defining noise and distortion in DIMA.
EDP gains achieved thus far. We identify different methods toachieve this gain in later sections. [I HOPE WE CAN SHOWTHIS]. If the digital architecture’s compute energy were to beincluded in the above analysis, we would find that DIMA’slow-swing analog computation is approximately 10⇥ lowerthan that of the digital architecture.
The plots of the energy models (2)-(3) in Fig. 2 shows thatenergy consumption increases linearly with �VBL and Nrowfor both architectures due to the increased precharge energy.However, DIMA achieves aggressive energy efficiency byamortizing the precharge energy over B⇥Ncol bits comparedto Ncol/L bits in the digital architecture. Figure 2 also showsthat the silicon measured energy consumption reported in[9] for Nrow = 512 and �VBL = 500mV closely matchesthe values obtained from the energy models (2)-(3) (11%modeling error) further validating the energy-delay analysisin this section.
C. Noise and Distortion ModelsThough DIMA provides significant gains in EDP this comes
at the expense of the SNR of its analog computations [?].To see this trade-off, consider the computation of a N -dimensional dot product shown below:
y = WTX =NX
i=1
wixi (6)
where W = [w1, w2, . . . , wN ]T and X = [x1, x2, . . . , xN ]T
are two N -dimensional real-valued vectors of precision BW
and BX , respectively. The dot product computation in (6)is pervasive and most compute-intensive in machine learningalgorithms.
In DIMA, the dot product in (6) is modified to:
y =NX
i=1
[(wi + ⌘f,i)xi + ⌘b,i] + ⌘c (7)
where �VBL,i = wi + ⌘f,i and �VBLP,i = (wi + ⌘f,i)xi + ⌘b,iare the observed noisy outputs of the FR and BLP stages of
4
TABLE I: Noise and distortion in DIMA stages [9].
Error type FR (⌘f ) BLP (⌘b) CBLP (⌘c)DP SAD
% distortion (µ) 2.6(1) 2.1(1) 2.5(1) 0.8(3)
% noise (�/µ) (5.3 -to- 22)(2) 2.8(2) 3.2(2) 0.2(3)
(1) silicon measured values(2) Monte Carlo simulations with 0.4V VWL 0.8V.(3) estimated from the capacitor sizes in [9].
the ith column, respectively, and �VCBLP = ⌘c is the thermalnoise at the output of CBLP (see Fig. 1(b)). In practice, wefind that �2
f,i� �2
b,i� �2
cas shown in Table I.
Distortion refers to percentage difference between an idealstraight line transfer function of the block and the mean ofthe measured values over a large number of cells. Thus, thedistortion errors magnitudes for FR and BLP were obtainedfrom silicon measurements in [9]. A dedicated test columnwas used to obtain measured FR outputs while the BLP outputdistortion was back-calculated from measured values of CBLPoutput.
The noise magnitudes for both FR (Vt variation in the accesstransistors) and BLP (TBD) were estimated via Monte Carlocircuit simulations. The CBLP distortion (charge injection) andnoise (thermal) values were calculated from the knowledge ofthe capacitances (25 fF) used in [9].
Note that the noise and distortion error magnitudes for SADare greater than those of DP because SAD [9] is computed byapplying FR to both W and X while DP is computed via anapplication of FR only to W.
It can be seen that the FR stage generates the highest errormagnitude of distortion and noise compared to BLP and CBLP.
The random error in FR stage is also a dominant randomerror source as the BL is discharged by minimum-sizedtransistors in the bitcell and operates in near-threshold voltageregime with low VWL.
In a digital architecture, the precision BY of y is usuallyset as:
BY = BX +BW + log2(N) (8)
where BX and BW are the precisions of the input xi andweights wi, respectively. Equation (8) reflects the typical bit-growth expected in digital arithmetic.
From (??), the BL discharge �VBL represents B bits of theweight wi where as
III. DIMA’S SHANNON-THEORETIC ATTRIBUTES
This section explains DIMA design principles based onSNR budgeting. The ML algorithms have inherent error re-siliency due to following reasons:
1) The thresholding operation into fixed number of classesprovides an algorithmic noise margin as small errors donot change the classification results,
2) The aggregation of a large number of elements, widelyused for the dimensionality reduction in ML algorithms,makes the system insensitive to component noise,
3) The ability to learn error statistics (e.g., retraining)reduces the impact of noise effectively.
(a)
(b)
Fig. 4: Simplified data-flow in: (a) DIMA and (b) the conven-tional architecture.
The inherent noise immunity can be exploited to achieveaggressive energy and throughput benefits for hardware im-plementations. Traditionally, supply voltage scaling has beenwidely employed to achieve energy efficiency, but at the cost ofsignificantly degraded throughput (e.g., quadratically degradedto voltage scaling [22]). Moreover, the error tends to happenat MSB in digital processors due to the carry ripple behaviorin arithmetic operations [23]. Another potential approach toexploit the ML’s error immunity is reducing the �VBL to savememory read energy. However, the conventional architecturewith separate memory and processor does not allow enough�VBL scaling. This is because the �VBL needs to be convertedto full swing to be processed in digital logic through SA, whichacts as bit-wise early decisions (Fig. 4(a)). If early decisionerrors occur at MSB by �VBL scaling, the error magnitudecan increase beyond application’s error immunity, leading toa catastrophic failure of the inference system (e.g., peak signal-to-noise ratio (PSNR) < 10dB with MSB error [4], [5]). Beingaware of this issue, several low-power SRAM techniques[4], [5], [24] were proposed for inference applications withselective MSB protection achieving limited energy savings inthe memory, but not in the processor.
To resolve these issues, DIMA eliminates the separationbetween the low-swing memory and high-swing processor, butbudgets SNR by employing aggressively low-swing operationsin both memory and process jointly (Fig. 4(b)). This is enabledby employing the FR, which implicitly performs D/A conver-sion, and subsequent mixed-signal BLP and CBLP process-ings. Despite degraded SNR, the system level robustness ofDIMA is maintained by following three principles: (1) delayeddecision, (2) non-uniform bit protection, and (3) aggregation.
digital DIMA
FR BLP CBLP
16
Naresh Shanbhag, University of Illinois at Urbana-Champaign
DIMA – A Shannon-inspired Perspective
• accuracy via delayed decision, unequal protection, massive aggregation
DIMA Digital
644 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 53, NO. 2, FEBRUARY 2018
Fig. 2. Dataflow of typical inference algorithms. (a) Dataflow diagram.(b) Computations required in SVM, TM, k-NN, and MF algorithms.
Fig. 3. Memory access pattern and bitline swing (!VBL) in (a) DIMA and(b) conventional architecture, assuming B = 4 and L = 4. Bitcells marked inred are accessed simultaneously.
accommodate large area sense amplifiers (SAs) as shownin Fig. 2(b), which limits the number of bits per accessto Ncol/L in standard SRAM compared to Ncol B in FR.Moreover, unlike the digital architecture, DIMA’s Ncol BLPs
TABLE I
DIMA VERSUS CONVENTIONAL ARCHITECTURE(WITH Ncol × Nrow BCA)
Fig. 4. Simplified dataflow in (a) DIMA and (b) conventional architecture.
and the CBLP are pitch-matched to the BCA avoiding the needto fully read out data (see Table I and Fig. 3 for a comparisonsummary).
Though DIMA needs much fewer precharge cycles toread the same number of bits leading to both energy andthroughput gains, it relaxes the fidelity/accuracy of its readsand computations. However, this loss is easily absorbedby the system-level robustness inherent in DIMA’s func-tional dataflow. A comparison of DIMA’s functional dataflow[Fig. 4(a)] with that of the conventional architecture [Fig. 4(b)]shows that: 1) the FR stage combined with the column-majorformat enables DIMA to assign significantly (exponentially)more weight (pulse width [21] or pulse amplitude [27]) toMSBs versus LSBs (significance-weighed bit protection) and2) DIMA makes its first and only hard decision (thresholding)in the final RDL stage thereby enabling the CBLP to enhance
644 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 53, NO. 2, FEBRUARY 2018
Fig. 2. Dataflow of typical inference algorithms. (a) Dataflow diagram.(b) Computations required in SVM, TM, k-NN, and MF algorithms.
Fig. 3. Memory access pattern and bitline swing (!VBL) in (a) DIMA and(b) conventional architecture, assuming B = 4 and L = 4. Bitcells marked inred are accessed simultaneously.
accommodate large area sense amplifiers (SAs) as shownin Fig. 2(b), which limits the number of bits per accessto Ncol/L in standard SRAM compared to Ncol B in FR.Moreover, unlike the digital architecture, DIMA’s Ncol BLPs
TABLE I
DIMA VERSUS CONVENTIONAL ARCHITECTURE(WITH Ncol × Nrow BCA)
Fig. 4. Simplified dataflow in (a) DIMA and (b) conventional architecture.
and the CBLP are pitch-matched to the BCA avoiding the needto fully read out data (see Table I and Fig. 3 for a comparisonsummary).
Though DIMA needs much fewer precharge cycles toread the same number of bits leading to both energy andthroughput gains, it relaxes the fidelity/accuracy of its readsand computations. However, this loss is easily absorbedby the system-level robustness inherent in DIMA’s func-tional dataflow. A comparison of DIMA’s functional dataflow[Fig. 4(a)] with that of the conventional architecture [Fig. 4(b)]shows that: 1) the FR stage combined with the column-majorformat enables DIMA to assign significantly (exponentially)more weight (pulse width [21] or pulse amplitude [27]) toMSBs versus LSBs (significance-weighed bit protection) and2) DIMA makes its first and only hard decision (thresholding)in the final RDL stage thereby enabling the CBLP to enhance
delayed decision
unequal protection
massiveaggregation
in analog
early decision
equalprotection
aggregation in digital
data → noisy channel → decoder
final decision
17
Naresh Shanbhag, University of Illinois at Urbana-Champaign
• a key rule in decision making – for accurate decisions delay making hard decisions
• DIMA follows this rule; digital arch. violates it → suffers from MSB errors
• DIMA’s accuracy ≥ digital in low-SNR regime – [Feb. JSSC’18 (UIUC)]
• accuracy gap increases with dimensionality # of data $
Delayed Decision7
capacitance. This results in less process variation at given BLswing as shown in Fig. 4(b).
The fundamental trade-off between energy vs. SNR is wellobserved in Fig. 4(b), where noise is suppressed (doing so,SNR increases) with higher BL swing at the cost of degradedenergy efficiency. Therefore, it is critical to find the optimal BLswing (and VWL), which minimizes the energy consumptionwithout degrading the application-level accuracy.
V. ROBUSTNESS OF DIMAThis section explains DIMA design principles based on
SNR budgeting. The ML algorithms have inherent error re-siliency due to following reasons:
1) The thresholding operation into fixed number of classesprovides an algorithmic noise margin as small errors donot change the classification results,
2) The aggregation of a large number of elements, widelyused for the dimensionality reduction in ML algorithms,makes the system insensitive to component noise,
3) The ability to learn error statistics (e.g., retraining)reduces the impact of noise effectively.
The inherent noise immunity can be exploited to achieveaggressive energy and throughput benefits for hardware im-plementations. Traditionally, supply voltage scaling has beenwidely employed to achieve energy efficiency, but at the cost ofsignificantly degraded throughput (e.g., quadratically degradedto voltage scaling [22]). Moreover, the error tends to happenat MSB in digital processors due to the carry ripple behaviorin arithmetic operations [23]. Another potential approach toexploit the ML’s error immunity is reducing the �VBL to savememory read energy. However, the conventional architecturewith separate memory and processor does not allow enough�VBL scaling. This is because the �VBL needs to be convertedto full swing to be processed in digital logic through SA, whichacts as bit-wise early decisions (Fig. 5(a)). If early decisionerrors occur at MSB by �VBL scaling, the error magnitudecan increase beyond application’s error immunity, leading toa catastrophic failure of the inference system (e.g., peak signal-to-noise ratio (PSNR) < 10dB with MSB error [4], [5]). Beingaware of this issue, several low-power SRAM techniques[4], [5], [24] were proposed for inference applications withselective MSB protection achieving limited energy savings inthe memory, but not in the processor.
To resolve these issues, DIMA eliminates the separationbetween the low-swing memory and high-swing processor, butbudgets SNR by employing aggressively low-swing operationsin both memory and process jointly (Fig. 5(b)). This is enabledby employing the FR, which implicitly performs D/A conver-sion, and subsequent mixed-signal BLP and CBLP process-ings. Despite degraded SNR, the system level robustness ofDIMA is maintained by following three principles: (1) delayeddecision, (2) non-uniform bit protection, and (3) aggregation.
A. Delayed DecisionDue to the absence of SAs between memory and processor,
DIMA does not require early decision right after BL dischargewhere noise ⌘1 is added, but hard decision (thresholding)
FR BLP
U
25
FR BLP
26
FR BLP
2«
+ 1/N
CBLP
······
$
$
$ analog processing
&5!
&6!
&«!
···
f( )
(a)
&5! Read ×
25
Read ×
26
Read ×
2«
+·········
U1/N
$
$
$
$
$
$
$
$
$
$···
&6!
&«!
SAs
$
$
$
f( )
(b)
Fig. 5: Simplified data-flow in: (a) DIMA and (b) the conven-tional architecture.
happens only at the end of classification process (Fig. 5(b)),namely delayed decision. Consider the example of binarybit transmission (x 2 {1,�1}) in Fig. 6(a), where noisedistributions ⌘1 and ⌘2 are modeled by N (0,�2), thresholdlevel 0, and equal priors P (X = 1) = P (X = �1) = 0.5are assumed. The bit error probabilities pe, assuming identicalindependent noise sources, can be derived as follows:
pe =
(2Q
�1�
� �1�Q
�1�
� (early decision)
Q⇣
1p2�
⌘(delayed decision)
(17)
Figure 6(b) shows pe behavior with respect to the SNR basedon (17). It is clearly shown that the delayed decision is morerobust in a low-SNR regime as the early decision tends toamplify noise contribution in the regime. In this sense, thedelayed decision improves the robustness of DIMA, wherethe SNR is tightly budgeted.
B. Non-Uniform Bit ProtectionThe FR with PWM (Fig. 1(c)) allows us to assign voltage
swing unequally based on the significance of information (bitpositions), namely non-uniform bit protection. When the totalswing �VBL,T is budgeted for the reading of B-bit number, theswing assigned for the b-th bit position �Vb for conventionalarchitecture and DIMA are given by
�Vb,CONV =�VBL,T
B, (18)
�Vb,DIMA =2b
2B � 1·�VBL,T. (19)
For example, the non-uniform bit protection gives roughly2.1⇥ more swing for MSB and 3.8⇥ less swing for LSB
7
capacitance. This results in less process variation at given BLswing as shown in Fig. 4(b).
The fundamental trade-off between energy vs. SNR is wellobserved in Fig. 4(b), where noise is suppressed (doing so,SNR increases) with higher BL swing at the cost of degradedenergy efficiency. Therefore, it is critical to find the optimal BLswing (and VWL), which minimizes the energy consumptionwithout degrading the application-level accuracy.
V. ROBUSTNESS OF DIMAThis section explains DIMA design principles based on
SNR budgeting. The ML algorithms have inherent error re-siliency due to following reasons:
1) The thresholding operation into fixed number of classesprovides an algorithmic noise margin as small errors donot change the classification results,
2) The aggregation of a large number of elements, widelyused for the dimensionality reduction in ML algorithms,makes the system insensitive to component noise,
3) The ability to learn error statistics (e.g., retraining)reduces the impact of noise effectively.
The inherent noise immunity can be exploited to achieveaggressive energy and throughput benefits for hardware im-plementations. Traditionally, supply voltage scaling has beenwidely employed to achieve energy efficiency, but at the cost ofsignificantly degraded throughput (e.g., quadratically degradedto voltage scaling [22]). Moreover, the error tends to happenat MSB in digital processors due to the carry ripple behaviorin arithmetic operations [23]. Another potential approach toexploit the ML’s error immunity is reducing the �VBL to savememory read energy. However, the conventional architecturewith separate memory and processor does not allow enough�VBL scaling. This is because the �VBL needs to be convertedto full swing to be processed in digital logic through SA, whichacts as bit-wise early decisions (Fig. 5(a)). If early decisionerrors occur at MSB by �VBL scaling, the error magnitudecan increase beyond application’s error immunity, leading toa catastrophic failure of the inference system (e.g., peak signal-to-noise ratio (PSNR) < 10dB with MSB error [4], [5]). Beingaware of this issue, several low-power SRAM techniques[4], [5], [24] were proposed for inference applications withselective MSB protection achieving limited energy savings inthe memory, but not in the processor.
To resolve these issues, DIMA eliminates the separationbetween the low-swing memory and high-swing processor, butbudgets SNR by employing aggressively low-swing operationsin both memory and process jointly (Fig. 5(b)). This is enabledby employing the FR, which implicitly performs D/A conver-sion, and subsequent mixed-signal BLP and CBLP process-ings. Despite degraded SNR, the system level robustness ofDIMA is maintained by following three principles: (1) delayeddecision, (2) non-uniform bit protection, and (3) aggregation.
A. Delayed DecisionDue to the absence of SAs between memory and processor,
DIMA does not require early decision right after BL dischargewhere noise ⌘1 is added, but hard decision (thresholding)
FR BLP
U
25
FR BLP
26
FR BLP
2«
+ 1/N
CBLP
······
$
$
$ analog processing
&5!
&6!
&«!
···
f( )
(a)
&5! Read ×
25
Read ×
26
Read ×
2«
+·········
U1/N
$
$
$
$
$
$
$
$
$
$···
&6!
&«!
SAs
$
$
$
f( )
(b)
Fig. 5: Simplified data-flow in: (a) DIMA and (b) the conven-tional architecture.
happens only at the end of classification process (Fig. 5(b)),namely delayed decision. Consider the example of binarybit transmission (x 2 {1,�1}) in Fig. 6(a), where noisedistributions ⌘1 and ⌘2 are modeled by N (0,�2), thresholdlevel 0, and equal priors P (X = 1) = P (X = �1) = 0.5are assumed. The bit error probabilities pe, assuming identicalindependent noise sources, can be derived as follows:
pe =
(2Q
�1�
� �1�Q
�1�
� (early decision)
Q⇣
1p2�
⌘(delayed decision)
(17)
Figure 6(b) shows pe behavior with respect to the SNR basedon (17). It is clearly shown that the delayed decision is morerobust in a low-SNR regime as the early decision tends toamplify noise contribution in the regime. In this sense, thedelayed decision improves the robustness of DIMA, wherethe SNR is tightly budgeted.
B. Non-Uniform Bit ProtectionThe FR with PWM (Fig. 1(c)) allows us to assign voltage
swing unequally based on the significance of information (bitpositions), namely non-uniform bit protection. When the totalswing �VBL,T is budgeted for the reading of B-bit number, theswing assigned for the b-th bit position �Vb for conventionalarchitecture and DIMA are given by
�Vb,CONV =�VBL,T
B, (18)
�Vb,DIMA =2b
2B � 1·�VBL,T. (19)
For example, the non-uniform bit protection gives roughly2.1⇥ more swing for MSB and 3.8⇥ less swing for LSB
DIMA Digital
hard decision
hard decisionhard decision
18
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Massive Aggregation9
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
(!/#
)of∆V B
LB
StoredvalueW (4-b)
65%
0100(2)
1000(2)
Fig. 8: Measured normalized standard deviation (�/µ) of BLswing with respect to the 4-b stored data when BL swing perbit = [80]mV.
C. AggregationThe charge-sharing process in the CBLP efficiently averages
out noise contributions by reducing the standard deviationof output by
pN times after aggregating N independent
mean-zero random noise sources in Fig. 5(b). Specifically, theDIMA’s CBLP is effective in averaging out the noise in theSD computations as the error statistics of noise in the analogcircuitry closely follows independent mean-zero behavior (e.g.,spatial threshold variation).
Foe example, DIMA computes the SAD in the analogdomain as follows:
v =1
N
NX
i=1
vi (26)
where vi denotes the voltage swing corresponding to |Di�Pi|in Fig. 5(a). By taking into account the noise, (26) can bemodified into
bv =1
N
NX
i=1
(vi + ni) = v + n (27)
where bv denotes the noisy SAD output due to nl ⇠ N (0,�2).Since nis are i.i.d., we note that
n =1
N
NX
i=1
ni ⇠ N✓0,
1
N�2
◆. (28)
This behavior was well observed from the measurement ofsilicon IC prototypes in [9]. The normalized standard deviation(�/µ) of each BL was maximum 12%, but the output of CBLPblock has only 1.1% after charge-sharing 128 BLs. Moreover,DIMA provides the average-out opportunity for both ⌘a1and ⌘a2 before the thresholding due to the delayed decisionwhereas the conventional digital architecture goes through theearly decision before having the aggregation opportunity.
Additionally, FR process also contains such averaging-outeffect as FR accumulates the amount of discharge by multipletransistors. Figure 8 shows that the measured normalizedstandard deviation (�/µ) of BL swing is larger when theBL is discharged by single transistor, e.g., 4 = 0100(2) or
8 = 1000(2). On the other hand, the variation is suppressed byup to 65% when the BL is discharged by multiple transistorsdue to the averaging-out during FR.
VI. SYSTEM-LEVEL ACCURACY PREDICTION
In this section, the system-level performance is predictedin terms of probability of detection pdet with major designparameters based on the behavioral models. By deriving theprobabilities of detection for conventional system and DIMA,we show that the DIMA can perform robust classificationtasks. We analyze two different tasks TM with SAD kerneland SVM with dot product kernel.
A. Template MatchingIn TM, the SAD values between an input query template
and multiple candidates are measured to find the candidatewith minimum SAD. In conventional architecture, the SAD isgiven by
s =1
N
NX
i=1
xi (29)
where xi = |Di � Pi|. If we consider the circuit non-idealities,then (29) is modified into
bs = 1
N
NX
i=1
(xi + ei) = s+ e (30)
wheres bs represents the noisy SAD output. The circuit non-idealities result in the decimal error ei and e = 1
N
PN
i=1 ei.Suppose that there are two images D(i) and D(j). Their
corresponding SAD outputs are s(i) and s(j), respectively.If s(i) < s(j), then the TM output will become D(i) sinceits Manhattan distance to P is closer than D(j). The TMdecision can be changed due to the circuit non-idealities ifbs(i) > bs(j). The mismatch probability that D(j) is incorrectlychosen instead of D(i) is given by
p(i!j)CONV = Q
↵(i,j)
TM
s3(2B � 1)
2(2B + 1)· Np
!(31)
where ↵(i,j)TM denotes the normalized decision margin such as
↵(i,j)TM =
��s(j) � s(i)��
(2B � 1)(32)
where 0 ↵TM 1. We observe that the impact of bit errorprobability on p(i!j)
CONV would be suppressed by a larger N (Thedetailed derivation can be found in Appendix B).
Without loss of generality, suppose that D(0) =mini2{0,...,M�1}
��P�D(i)��. Then, the detection probability
of conventional architecture is given by
Pdet,CONV =M�1Y
m=1
(1� p(0!m)CONV )
=M�1Y
m=1
1�Q
↵(0,m)
TM
s3(2B � 1)
2(2B + 1)· Np
!!.
(33)
[ISSCC’18 (UIUC)]!" = 12% !
" = 1%
• DIMA’s analog domain aggregation boosts SNR
– SNR boost factor = '()* (Central Limit Theorem)
• Digital architectures miss out on this opportunity
19
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Future Prospects
20
Naresh Shanbhag, University of Illinois at Urbana-Champaign
q analog-aware ISAq Julia (compiler) software stackq energy-accuracy optimization passq 22X EDP reduction for a 8b, 5-layer DNN
Scaling-up DIMA
aSD
aVD
PWM
WL
driv
er
512 X 2566T SRAM
Bit-cell Array
Bit�cell
Bit-cell
Read/Write path
X
aSD
CTRL
Task
1Ta
sk2
Task
3Ta
sk4
Bit�cell
Bit�cell
X-REG [7:0][127:0]
decision
Clas
s 2(a
SD, a
VD)
Clas
s 1 (a
READ
)
ADC
Clas
s 4 (T
H)
Write data buffer Binary Inst.
currentTP
PWM
WL
driv
er
meanthres
sigmoidReLu
minmax
acc
Cont
rol s
igna
ls Clas
s 3 (A
DC)
W
(a)
(b)
Figure 2: The PROMISE architecture: (a) a single-bankarchitecture with NROW = 512, NCOL = 256, analog pro-cessing marked in red, and (b) a multi-bank architecture.
prevent the noise from analog operations accumulating overthe iterations. This digitization also enables a multiple-bankarchitecture, where reliable data transfers between banksare required. TH implements non-linear operations such assigmoid via piece-wise linear approximation [27] not onlyto compute the decision functions f () in Table I but alsoto aggregate intermediate computed values when the vectorlength N is larger than 128.
Lastly, X-REG is a digital block similar to a vector registerfile in a SIMD processor, holding eight 128-element vectorsrepresenting eight X values. CTRL is a controller to generateenable signals for the aforementioned components based on agiven instruction and make CM function as a programmablemixed-signal accelerator.Multiple bank architecture: PROMISE can be extended toa multi-bank structure (Fig. 2(b)), which has multiple (upto eight in this work) PAGEs, each of which includes fourbanks. Thus, long (> 128) vectors can be distributed acrossmultiple banks for parallel processing. PROMISE does nothave the scalability limitation by the analog operations asthe partial results from each bank are always converted todigital values by the ADCs. Then, the partial sums can beaggregated across different banks via cross-bank rail, similarto H-Tree in [12]. The data transfer of 8-bit ADC outputfrom each bank to the other through the cross-bank rail
Figure 3: Analog pipeline in PROMISE with operatingfrequency = 1/TP, and red marked area is analog domain.
takes only 0.5 pJ with activity factor of 0.5 from post-layoutsimulations. This is negligible (< 1%) as a aREAD consumesup to 131 pJ/bank as listed in Table III. The control of multi-bank by the instruction is discussed in Section III-C.Analog pipeline architecture: To achieve higher throughputthan the original CM [9], PROMISE adopts analog pipelin-ing as shown in Fig. 3. This requires analog Flip-Flops(FFs) to hold the output of each stage. The first pipelinedstage (aREAD) needs NCOL analog FFs whereas the secondpipelined stage consisting of aSD and aVD needs only oneanalog FF to store the aggregated scalar output. Thus, wepropose to reuse a pre-existing capacitor, which is a part ofthe analog multiplier [9] in the aSD block, as an analog FFfor the aREAD stage to minimize the hardware overhead.
B. Challenges in Programmable Mixed-Signal Accelerator
The combination of algorithmic diversity and mixed-signaloperations in PROMISE creates many challenges that weshould consider when exploring ISA design choices. Tosupport a broad range of ML algorithms, each stage needsto support diverse programmable operations. However, weidentify some challenges in supporting (too) many diverseprogrammable operations which can significantly degradeboth throughput and accuracy.Higher Impact of Delay Variation across Operations:Analog operations often exhibit a wide range of delaysdepending on operation types [9], while each stage mustaccommodate the worst case delay out of all the operations forthe stage. Furthermore, as all the stages of a pipelined mixed-signal accelerator need to operate at the same clock period TP,TP needs to accommodate the worst case delay across all thestages, i.e., TP = max(TS1 max,TS2 max,TS3 max,TS4 max), asillustrated in Fig. 4). That is, supporting more programmableoperations causes longer idle time for some stages. As aresult, we observe up to 2⇥ throughput degradation whendesigning for the range of operations supported in this work.Furthermore, such long idle time negatively affects accuracy
at the algorithm level because analog values are typicallystored on (area-constrained) capacitors and thus degrade overtime due to various leakage mechanisms. This problem isespecially bad for the S1 stage (aREAD) as each BL issubject to the leakage contributions from all the bit-cells inthe column (up to 0.6%/ns).More Limit in Sequence of Operations: We also recog-nize some other challenges specific to analog processing,
4
!"!#!$!"!#!$%&'&()!$
%&'&()!#
!*!*
!" !" !" !" !"!" = max !#$_%&',!#(_%&',!#)_%&',!#*_%&'
Figure 4: 4-stage processing in PROMISE with opera-tional diversity per stage.
which further limits the number of possible programmableoperations. Specifically, a chain of analog processing stagesimposes intrinsic sequentiality of operations where twoconsecutive stages need to be physically closely placed.This is done to avoid substantial degradation in analogvoltage from one stage to the next. Furthermore, a largecapacitor ratio between consecutive stages (input-output 20:1to maintain the voltage drop < 5%) is required to transfer thesignal via charge-transfer mechanism without an additionalanalog buffer.
C. PROMISE Instruction Set
We explore an ISA suitable for a programmable mixed-signal accelerator, considering the above constraints.Instruction Format: We propose a wide-word macro instruc-tion format, which is referred to as Task. Akin to a Very-Large Instruction Word (VLIW), a single Task consists ofmultiple operations, except that the operations are sequentialand not parallel like VLIW architectures. As depicted inFig. 5(a), the four Class fields specify four operations forfour pipelined stages of PROMISE, while the three otherfields, OP_PARAM, RPT_NUM and MULTI_BANK configureall or specific Class operations. More specific descriptionsof these seven fields are as follows.Operating Parameter Field: OP_PARAM comprises 33 bitsand configures operating parameters of Class operationsin a given Task, facilitating flexible programmability. Asshown in Fig. 5(b), W_ADDR specifies a CM address fora Class-1 operation. X_ADDR1 and X_ADDR2 designateX-REG addresses for Class-1 and Class-2, respectively.SWING controls BL swing ∆VBL, e.g., 111 allows 30mV/LSB whereas 001 allows 5 mV/LSB. This parameteris a key knob to control the trade-off between energy andaccuracy under software control; Section VI evaluates thisaccuracy-energy trade-off for further energy savings. Referto Fig. 5(c) for descriptions of the remaining parameters andtheir assignments of bit fields.Class Fields: Class-1 is composed of 3-bit opcode anddefines six possible memory operations. READ, WRITE, oraREAD makes CM perform a digital read, digital write, oranalog read operation to a compute-memory address specifiedby OP_PARAM (W_ADDR). aADD or aSUB fuses an analogread and an element-wise analog addition or subtractioninto a single operation where two vector operands come
!"#$!"#"$%$&'()!*+,-.
%"/#01&'2 !*+,-.
&13/45$06'(!*+,-.
789--:;'<!*+,-.
789--:('=!*+,-.
789--:<';!*+,-.
789--:='<!*+,-.
(a)
Contents Bits Description
SWING [27:25] ∆VBL swing code –000: min (5 mV/LSB), 111: max (30 mV/LSB)
ACC_NUM [24:23] # of operands to be accumulated for accumulateopcode in Class-4
W_ADDR [22:14] Bit-cell array address of W in Class-1X_ADDR1 [13:11] Bit-cell array address of X in Class-1X_ADDR2 [10:8] X-REG array address of X in Class-2X_PRD [7:6] X_ADDR1 & 2 circulate from 0 to “X_PRD - 1”DES [5:4] Class-4 output destination - 00: ACC input, 01:
output buffer, 10: X-REG, 11:Write data bufferTHRES_VAL [3:0] Thresholding reference value for threshold opcode
in Class-4
(b)
!"#$$ %&'(#)*+,!-+&'(#,./ 0*)!"',1)2 %3!%45 %&)*+,
!
!"!#
"!#$%&'!"#!$%&
'''
$%&'((!$%!((') )*+),-./012+34!'!/3!5$%&'(678
*+,-#./%!(('0 ''7+#12./%!(('0 '7'1'3!(./%!(('0 '77
145"6./%!(('7#$%!((')0# 7''1!((./%!(('7#$%!((')0 7'7
9
!"!#
: !;*/1<!"#!$%=!.>$&
''' .>$ ;*/('(!?3 .@@+0@./*3?7(!.@@+0@./*3?
$%&'((!$%!(('8 )*+),-./012+34!'!/3!5$%&'(678
9":;1+# ''71<=">?-# '7'=@?1+# '77
=,A!%:?>-.$%!(('80 7''?!=,A!%:?>-.$%!(('80 7'7
A !"!# 7!;*/<!"#!$%&
'!(B 7
:
199?:?>1-,"!
A!;*/1<!"#!$%&
''' (34B!!BB%C5D:#1! ''7 (34
-E+#=E">2 '7' (34B!6F'34%G!H:1I '77 (34:,! 7'' (34
=,A:",2 7'7 (34'#H? 777 (34
(c)
Figure 5: Instruction set of PROMISE: (a) instructionformat, (b) operation parameters (OP_PARAM), and (c)operations in each Class.
from compute-memory addresses specified by OP_PARAM(W_ADDR and X_ADDR1), respectively.Class-2 consists of 4-bit opcode and specifies a com-
position of one of six possible aSD operations with one oftwo possible aVD operations. Specifically, aSD operatingon a computed value from Class-1 supports three unaryoperations: compare, absolute, and square and twobinary operations: sign_mult and unsign_mult wherethe other operand comes from an X-REG address specifiedby OP_PARAM (X_ADDR2). aVD specifies whether an ag-gregation should be performed or not after an aSD operation.Class-3 and Class-4 comprise 1- and 3-bit op-
code and control whether an ADC should be performedor not and specify one of seven possible TH operations,
47
DIMA accelerator (ISCA’18, UIUC) multi-bank DIMA (VLSI’18, Princeton)
instruction set
o 8T BC; 2.4Mb BCA; 64 bank; 1b compute; charge-based summing;
o accuracy - SVHN (94%); CIFAR-10 (83%) with 9-layer ConvNets
o 658 TOPS/W
21
[2019 IEEE MICRO Top Picks Honorable Mention]
Naresh Shanbhag, University of Illinois at Urbana-Champaign
q larger BL caps, smaller BCs, slower devices, higher !" variations
DIMA in Flash & Emerging Memories
q high conductance, small TMR, high MTJ variations, write endurance
…
…
…
…
Δ#$,&
'&&,(
'&)
Δ#$,) …
…
'&&,&
'&&,)
'&&,*
'&&,+
ADC ,&-.
WL0
WL4
WL1
WL2
WL3
WLdrivers
CurrentIntegrator
(CI)
Functional Read (D/A)
X-D
ec. &
Pul
se G
en.
16kB Input/Weights Reg.
BL Processor(BLP)
(multIplier)
Cross BL processor (CBLP)
BL Processor(BLP)
(multiplier)
WBL[0] WBL[255]
Ana
log
Proc
esso
r
WL
Dri
ver
16k Multiplier
128k X 192kSLC NAND Flash Array
Out
Weights
X-Dec. &
Pulse Gen.
WL D
river
A/D
BSL
BL
WL<63>
WL<0>
SSL
WL<62>Vread
Vpass
Vpass
Flash-based DIMA (ISCAS’18) MRAM-based DIMA (ISCAS’19)
[UIUC, Micron] [DARPA ERI: UIUC, Princeton, GLOBALFOUNDRIES, Raytheon]
22
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Objective: Explore DIMA acceleration of customized neuro-inspired algorithms for
Automatic Target Recognition
Sandia-UIUC Project: DIMA Acceleration of ATR Workloads
23
and a stride greater than one. In most cases, the pooling replacement was done immediately before the classical convolutional layer. The stride and size greater than one allows the layer to reduce the area dimension, and to move up the feature hierarchy. Therefore, the convolutional layer is able to act as a pooling layer. An example of two layers in the EEDN model is shown in Fig. 3.
Each group also contains several network-in-network (NIN) layers. Each convolutional layer acts as a classifier for some type of feature, but alone they are limited to be linear classifiers. NIN layers contain kernels with an area of one, but a very large depth. They help to fine-tune the classifications of the traditional convolutional layer by building more complex classifiers and allowing more configurable parameters. They are sandwiched between traditional convolutional layers and pooling layers, but do not change the data dimensions.
The EEDN software [9] is a deep learning toolkit [6] developed by IBM to work with TrueNorth processor. It uses the MatConvNet framework to train networks from MATLAB, then it converts them to run on TrueNorth using the Corelet programming language. While TrueNorth supports a variety of parameters for the synaptic weights and behaviors, the EEDN toolkit only allows three deterministic weights: [-1, 0, 1]. Therefore, during the training process, the loss function included a factor that encourages the then floating point weights to settle where it is easy to convert them. When they are converted during testing, usually some accuracy loss occurs.
The EEDN software and NSCS simulator were used to estimate the accuracy of variations of the original network on TrueNorth. Once the image database was created, the model was trained from Matlab on a conventional GPU-accelerated server. Throughout training, the software produces an estimation of the performance on TrueNorth, along with validating on the validation dataset. At the end, the model was tested with a separate testing set using the NSCS simulator. The NSCS simulator, while not truly the TrueNorth hardware, did give us the best estimate of how well the network would do on the hardware itself. Finally, certain models were selected, then tested on the TrueNorth NS1e board, and measured for the speed of computation, accuracy, and the active energy per image classification.
B.� EEDN Model for SAR Imagery Classification We build our models based on the DCNN model layers
supported by the EEDN software. A full-size model is designed to use almost exactly one chip, 4,064 cores during testing. It is composed of 15 convolutional layers, however with the NIN layers discounted, and the convolutional layers that down sample act as the pooling layers it is roughly comparable to a network with 4 convolutional layers and 4 pooling layers as defined in Caffe [5].
The full-size model used contained more feature planes than most previous classification networks for this data. The purpose of this is to allow some redundancy within the network, to reduce the loss when the precision of the synaptic weight is reduced from floating point values to the restricted set of [-1, 0, 1]. In some ways, that loss of precision can help to act similarly to a dropout layer. Having the extra planes helps the network
average across that precision loss. However, a balance needs to be struck between using too large of a network (and encouraging over fitting), and having too small of a network which would lose too much vital information when converted. Comparing the training, test, and validation accuracy across different conditions accessed that balance. If the training accuracy is high, but the validation and testing accuracies are low, then it was likely the network was over fitting the data, and hence was too large. However, if the training, validation and testing accuracies are relatively high prior to conversion, but testing on hardware after conversion went poorly, then the network is too sensitive to that loss of precision, and needed to be bigger.
There are several variations on the original model tested. In one set, the number of outputs (feature planes) of each layer is scaled evenly by factors of 1/8. In another, the preprocessing layer is left untouched, but the rest of the network was scaled evenly. This is to test the effect of the “extra” feature planes on the robustness of the network. While the model is scaled evenly, the number of active cores does not scale linearly with the number of output layers.
Also tested is the network, scaled the same by factors of 1/8, but with all of the network-in-network layers removed. This is to test the hypothesis that those extra layers were serving to help add complexity to the kernels, improving accuracy. It was suspected that removing these layers would greatly reduce the accuracy of the model.
IV.�EXPERIMENTAL SETUPS AND RESULTS
A.� The Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Dataset The dataset used for this experiment is composed of
Synthetic Aperture Radar (SAR) images from the MSTAR public dataset. The images are taken by a side-looking radar antenna of a plane passing overhead. The magnitude and phase of the radio waves are collected, and the magnitude data can be converted into a grayscale JPEG image. Once the image is converted, the target is in the center of an approximately 128×128-pixel image. The dataset were primarily collected from two different depression angels, 15 degrees and 17 degrees. In existing machine learning based research on this dataset, the 17-degree data are usually used for training, and the 15-degree data for testing.
There are 10 target classes in the dataset, consisting of mostly military vehicles. Fig. 4 shows the sample SAR imageries of the ten classes of targets. There are between one
and three physical representations of each vehicle, and some Fig. 4. Sample imageries and their class labels in the MSTAR public dataset.
MSTAR Dataset
Neuro-inspired Algorithms
Deep In-memory Architecture (DIMA)
Silicon-validated Noise, Energy,
Latency metrics
MPM/DNN/CNN
Cost-benefit Analysis
Naresh Shanbhag, University of Illinois at Urbana-Champaign
§ accuracy > 99%
§ fixed-point < 0.07% drop
§ 6.5X reduction in storage
Convolutional Neural Networks for ATR
24
MorganNet [Algorithms for SAR‘15] WagnerNet [TAES’16]
Model Data Augmentation QuantizationScheme
Number of Parameters
Storage Needed (KB)
Test Accuracy
(%)
WagnerNet None FL 410330 1602 97.22
WagnerNet ED(10,3) + ED(20,3) FL 410330 1602 99.21
QWagnerNet ED(10,3) + ED(20,3) FX-5-4 410330 250 99.40MorganNet None FL 88162 344 97.81MorganNet ED(10,3) + ED(20,3) FL 88162 344 99.13
QMorganNet ED(10,3) + ED(20,3) FX-5-4 88162 54 99.06
Algorithm Simulations
Naresh Shanbhag, University of Illinois at Urbana-Champaign
§ functional read noise is data-dependent: !"~$ 0, '()*+ → validated in silicon
§ model complexity v.s. noise immunity tradeoff
§ energy vs. accuracy trade-off → )* reduces with higher WL voltage
DIMA Inference Accuracy on the MSTAR Dataset
25
Next Steps
§ Develop energy-delay-accuracy models of DIMA components§ Map DNNs/CNNs to DIMA§ Study system-level trade-offs in realizing ATR in DIMA§ Benchmark against a digital architecture baseline.
Naresh Shanbhag, University of Illinois at Urbana-Champaign
The Future
!
computingplatform
finiteenergysource
distributedsensors
"
#
DECISIONS
distributedsensors
DATA
memory
processor
decisionlatency
DATA
energy/decision
decisionaccuracy
DIMA-enabled Autonomous Agents
DIMA
fully autonomous functionality via in-situ information processing of rich multimodal sensory inputs under
severely constrained volume, energy & computational resources
26
Naresh Shanbhag, University of Illinois at Urbana-Champaign
Acknowledgments
• Student Collaborators – Mingu Kang, Sujan Gonugondla, Ameya Patil, YongjuneKim, Charbel Sakr, Hassan Dbouk, Kuk-Hwan Kim,….
• Faculty Collaborators – SONIC faculty including Naveen Verma, Lav Varshney, Pavan Hanumolu, Boris Murmann, David Blaauw, Jan Rabaey, Subhasish Mitra, Philip Wong….
• Sponsors - National Science Foundation, the Defense Advanced Projects Agency, the Semiconductor Research Corporation, Texas Instruments, Micron, IBM, Raytheon Missile Systems, GLOBALFOUNDRIES, Intel Corporation, Sandia National Labs.
• Research Programs (Centers) – FCRP (GSRC), STARnet (SONIC), JUMP (C-BRIC), Intel (ISTC)
27