naresh shanbhag university of illinois at urbana-champaign ... · design principles for low...

A4H Field Day, Sandia National Labs, April 18, 2019

Deep In-memory Architectures

Naresh ShanbhagUniversity of Illinois at Urbana-Champaign

Sandia Collaborator – Craig Vineyard

Naresh Shanbhag, University of Illinois at Urbana-Champaign

• machines can beat humans in specific cognitive tasks

• game of Go is complex → huge search space: ~250&'( (Go) vs. ~35*( (Chess)

• AlphaGo machine: 1202 CPUs+176 GPUs

• HUGE Energy Cost ~10,000× more than human brain

The Energy Cost of Intelligence

The Economist, March 2016


fundamental question

how do we design intelligent machines that

operate at the limits of energy-delay-accuracy?

2


Communication Receiver

The Brain

stochastic channel stochastic neural fabric

STOCHASTIC COMPONENTS

+

STATISTICAL MODELS of

COMPUTATION

Operating Machines @ the Limits(Energy-Latency-Accuracy)

How to do this in a principled manner?

make errors (for efficiency) then correct (for accuracy)

3


• Proceedings of IEEE, Special Issue on Non von Neumann Computing, January 2019.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Shannon-Inspired StatisticalComputing for theNanoscale Era

By NARESH R. SHANBHAG , Fellow IEEE, NAVEEN VERMA, Member IEEE,

YONGJUNE KIM , Member IEEE, AMEYA D. PATIL, Student Member IEEE,AND LAV R. VARSHNEY , Senior Member IEEE

ABSTRACT | Modern day computing systems are based on the

von Neumann architecture proposed in 1945 but face dual

challenges of: 1) unique data-centric requirements of emerging

applications and 2) increased nondeterminism of nanoscale

technologies caused by process variations and failures. This

paper presents a Shannon-inspired statistical model of com-

putation (statistical computing) that addresses the statistical

attributes of both emerging cognitive workloads and nanoscale

fabrics within a common framework. Statistical computing is a

principled approach to the design of non-von Neumann archi-

tectures. It emphasizes the use of information-based metrics;

enables the determination of fundamental limits on energy,

latency, and accuracy; guides the exploration of statistical

design principles for low signal-to-noise ratio (SNR) circuit

fabrics and architectures such as deep in-memory architecture

(DIMA) and deep in-sensor architecture (DISA); and thereby

provides a framework for the design of computing systems that

approach the limits of energy efficiency, latency, and accuracy.

From its early origins, Shannon-inspired statistical computing

has grown into a concrete design framework validated exten-

sively via both theory and laboratory prototypes in both CMOS

and beyond. The framework continues to grow at both of these

levels, yielding new ways of connecting systems through archi-

Manuscript received March 11, 2018; revised July 5, 2018; acceptedSeptember 1, 2018. This work was supported by Systems on NanoscaleInformation fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored bythe Microelectronics Advanced Research Corporation (MARCO) and the DefenseAdvanced Research Projects Agency (DARPA). (Corresponding author:Naresh R. Shanbhag.)N. R. Shanbhag, Y. Kim, A. D. Patil, and L. R. Varshney are with theCoordinated Science Laboratory, University of Illinois at Urbana–Champaign,Urbana, IL 61801 USA (e-mail: [email protected]).N. Verma is with the Department of Electrical Engineering, PrincetonUniversity, Princeton, NJ 08544 USA.

Digital Object Identifier 10.1109/JPROC.2018.2869867

tectures, circuits, and devices, for the semiconductor roadmap

to march into the nanoscale era.

KEYWORDS | Artificial intelligence; computing; informa-

tion theory; machine learning; nanoscale devices; statistical

computing

I. I N T R O D U C T I O N

The invention of the Turing machine [1] (1937) andits realization via the von Neumann architecture [2](1945) helped spawn the modern information age. Thediscovery of the transistor by Bardeen, Brattain, andShockley (1948), and subsequently, the integrated circuitby Kilby (1954), provided an efficient and cost-effectivesubstrate to realize complex computing systems. Sincethen, computing systems, through various generations ofmicroarchitecture and semiconductor technology [Moore’sLaw [3] (1965)] have fueled growth in nearly all industriesand transformed health, energy, transportation, education,finance, entertainment, and leisure.

Computing systems, however, now face two major chal-lenges. The first is the emergence of nondeterminism innanoscale fabrics due to increased stochasticity (variationsand failures) as well as diminishing energy-delay benefitsvia CMOS scaling. Stochasticity in nanoscale fabrics iscontrary to the deterministic Boolean switch required bythe traditional von Neumann architecture. While manybeyond-CMOS devices have been invented, none surpassthe silicon MOSFET under metrics for a deterministicswitch [4], [5]. The second is the emergence of data-centric cognitive applications that emphasize exploration-,learning-, and association-based computations using sta-tistical representations and processing of massive datavolumes [6]–[8]. The data-centric nature of these applica-

0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

PROCEEDINGS OF THE IEEE 1

4


Shannon-inspired Model of Computing

use information-based metrics e.g., mutual information !(#$; &#)1

design low SNR fabrics, e.g., deep in-memory architecture (DIMA)2

develop statistical error-compensation (SEC) techniques3

(noisy channel)

DIMA

5

[ISSCC’18]


Timeline

2014 DIMA concept paper (ICASSP)[UIUC]

2016 AdaBoost DIMA IC (VLSI)[Princeton]

2017 Multifunction & Random Forest DIMA ICs (ESSCIRC)[UIUC]

2018 In-memory Special Session/Forums (ISSCC,VLSI,ICCAD)

SGD-SVM DIMA IC (ISSCC)[UIUC] Scalable DIMA IC (VLSI)[Princeton]

In-memory ICs (ISSCC) [MIT,NTHU,NTHU-ASU]

DIMA US Patent [UIUC]

In-memory ICs (VLSI)[Columbia-ASU]

In-memory ICs (ISSCC)[UMich, NTHU-ASU-GIT, SU(China)], ISSCC Forum, CICC Forum ………………………….

2019

6


The Deep In-memory Architecture

7

[Shanbhag (UIUC)(ICASSP’14,3xJSSC’18); Verma (Princeton) (VLSI’16,JSSC’17,VLSI’18)]


The Energy Cost of Data Movement

[Horowitz, ISSCC’14]

Top-1accuracy[%

]

Operations[G-Ops]

BN-AlexNet

AlexNet

BN-NIN

ENet

Inception-v3

Inception-v4

ResNet-152

VGG-16 VGG-19

155M35M

40050

80

#ofparameters

!"#"!"$%

≈ ~())× (SRAM) → ~,))× (DRAM) → ~()))× (Flash)

Computation Energy (45nm)Memory Energy (45nm)

[Canziani, Arxiv’16]

8

BLP BLP BLP BLP BLP BLP

Cross Bitline Processor

Residual Digital Unit

Precharge/Column Mux/Y-DEC

X-D

EC

X-D

EC

inference/decisions

bitline processing (BLP)(SIMD analog processing)

multi-row functional READ(reads multiple-bits/row/precharge)

cross bitline processing (CBLP)(analog averaging enhances SNR)

reduced complexity, bandwidth-efficient digital

output

analog, mixed-signallow-SNR processing

①

②③

④

standard 6T SRAM BCA(preserves storage density)

The DIMA Approach - read functions of data, never the raw data


• matrix-vector product/read cycle

• column-major format

• multi-row access/read cycle

• modulated access pulses

DIMA’s 1st Stage - Functional READ

3

(a)

(b)

T0T1=2T0T2=4T0T3=8T0

∆VBL(W)VBL

VWL3

VWL2

VWL1

VWL0

8∆Vlsb

4∆Vlsb 2∆Vlsb ∆Vlsb

w3 w2 w1 w0(c)

Fig. 1: Deep in-memory architecture (DIMA) [7], [9], [11]: (a)single-bank showing functional read (FR), bitline processing(BLP), cross BLP (CBLP) and sense amplifier (SA), wherewordline (WL) drivers shaded in RED are turned on simultane-ously, (b) column architecture and bitcell, and (c) idealized FRusing pulse-width modulated (PWM) WL access pulses duringa 4-bit word (W = 0000b0) read-out. The WL pulses can beoverlapped in time but are shown as being non-overlapped toenhance clarity.

A. DIMA Processing Stages

The DIMA employs standard SRAM BCA and read cir-cuitry (at the bottom) to preserve conventional read func-tionality in normal read mode as shown in Fig. 1(a), wherewrite circuitry is omitted for simplicity. On the other hand,in-memory computing mode processes both memory accessand computation via embedded analog processors (at the top).For this mode, the B bits of the scalar weight W are pre-stored in a column-major format vs. row-major used in theconventional read mode. DIMA based on a Nrow⇥Ncol BCAhave following features:

• Multi-row functional read (FR): fetches a word W byreading a function of B rows per BL precharge (readcycle) from each column (Fig. 1(c)). Thus, Ncol wordsare read simultaneously per read cycle.

• BL processing (BLP): computes SD between the W andX per column with an analog BLP block operating withlow voltage swing based on charge-transfer mechanism[9]. The Ncol BLP blocks operate in parallel in a single-instruction multiple data (SIMD) manner.

• Cross BL processing (CBLP): aggregates the Ncol analogSD results from BLP blocks by charge-sharing to obtainthe VD.

• Analog to digital converter (ADC) and residual digitallogic (RDL): processes a thresholding/decision functionf( ) and other miscellaneous functions.

The DIMA is well-matched to the data-flow (Table I) intrinsicto commonly encountered ML algorithms. The benefits ofDIMA are described next.

B. Energy and Throughput Benefits

The throughput of memory-intensive ML inferences areoften limited by memory access as the complexity of compu-tation is low. In this case, the computation can be processedduring the memory access of the next data in the pipelinedfashion. The SRAM in the conventional digital system requiresan L : 1 column mux ratio (typical L = 4–32) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), whichlimits the number of bits per access to Ncol/L in standardSRAM compared to NcolB in DIMA’s FR. Therefore, DIMAneeds LB times fewer precharge cycles to read the samenumber of bits leading to throughput gains. As a result,throughput improvement factor S by DIMA can be expressedas follows:

S = LB(TCONV

TDIMA) (5)

where the TCONV is the delay for single read access in theconventional system, and TDIMA is the combined delay for thesingle FR and following BLP and CBLP stages. The TDIMAis larger than TCONV by a factor of 4 - 6 due to the accessof multiple rows and subsequent analog computations. Hence,S = 3.2-12.8 is easily achievable, e.g., for typical values ofB = 4, L = 4, 8, 16, and TDIMA = 5TCONV.

The DIMA also achieves energy savings due to the fewerprecharge cycles and low-swing computations. The energy

3

(a)

VWL0

d0

d1

CWL

CWL

VBL

VPRE

6T SRAM bitcell

VWL1

Prech

6T SRAM bitcell

VWL0VWL0

d0 d0

CBL CBL

VBLB

(b)

(c)

Fig. 1: Deep in-memory architecture (DIMA) [7], [9], [11]: (a)single-bank showing functional read (FR), bitline processing(BLP), cross BLP (CBLP) and sense amplifier (SA), wherewordline (WL) drivers shaded in RED are turned on simultane-ously, (b) column architecture and bitcell, and (c) idealized FRusing pulse-width modulated (PWM) WL access pulses duringa 4-bit word (W = 0000b0) read-out. The WL pulses can beoverlapped in time but are shown as being non-overlapped toenhance clarity.

A. DIMA Processing Stages

The DIMA employs standard SRAM BCA and read cir-cuitry (at the bottom) to preserve conventional read func-tionality in normal read mode as shown in Fig. 1(a), wherewrite circuitry is omitted for simplicity. On the other hand,in-memory computing mode processes both memory accessand computation via embedded analog processors (at the top).For this mode, the B bits of the scalar weight W are pre-stored in a column-major format vs. row-major used in theconventional read mode. DIMA based on a Nrow⇥Ncol BCAhave following features:

• Multi-row functional read (FR): fetches a word W byreading a function of B rows per BL precharge (readcycle) from each column (Fig. 1(c)). Thus, Ncol wordsare read simultaneously per read cycle.

• BL processing (BLP): computes SD between the W andX per column with an analog BLP block operating withlow voltage swing based on charge-transfer mechanism[9]. The Ncol BLP blocks operate in parallel in a single-instruction multiple data (SIMD) manner.

• Cross BL processing (CBLP): aggregates the Ncol analogSD results from BLP blocks by charge-sharing to obtainthe VD.

• Analog to digital converter (ADC) and residual digitallogic (RDL): processes a thresholding/decision functionf( ) and other miscellaneous functions.

The DIMA is well-matched to the data-flow (Table I) intrinsicto commonly encountered ML algorithms. The benefits ofDIMA are described next.

B. Energy and Throughput Benefits

The throughput of memory-intensive ML inferences areoften limited by memory access as the complexity of compu-tation is low. In this case, the computation can be processedduring the memory access of the next data in the pipelinedfashion. The SRAM in the conventional digital system requiresan L : 1 column mux ratio (typical L = 4–32) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), whichlimits the number of bits per access to Ncol/L in standardSRAM compared to NcolB in DIMA’s FR. Therefore, DIMAneeds LB times fewer precharge cycles to read the samenumber of bits leading to throughput gains. As a result,throughput improvement factor S by DIMA can be expressedas follows:

S = LB(TCONV

TDIMA) (5)

where the TCONV is the delay for single read access in theconventional system, and TDIMA is the combined delay for thesingle FR and following BLP and CBLP stages. The TDIMAis larger than TCONV by a factor of 4 - 6 due to the accessof multiple rows and subsequent analog computations. Hence,S = 3.2-12.8 is easily achievable, e.g., for typical values ofB = 4, L = 4, 8, 16, and TDIMA = 5TCONV.

The DIMA also achieves energy savings due to the fewerprecharge cycles and low-swing computations. The energy

∆"#$ =&'())*#$

+,-.

/012,3,

Per-column dot-product

10

fundamental energy-delay vs. SNR trade-off


Bit-line Analog Processors

WL3

Precharge_b

d3

d2

d0

WL2

WL0

d1

WL1

ReplicaCell

MAX

BL BLB

RWL3~0

∆VBL∝D+P ∆VBLB∝D+P

····

D

P

W0W1

W255W254

W0W1

W255W254

W0W1

W255W254

W0W1

W255W254

BL0

BL1

BL254

BL255

Precharge Precharge Precharge Precharge

WLN(T)

DW0,N

DW0,N-1

DW0,N-3

WLN-1(T/2)

WLN-3(T/8)

DW0,N-2

WLN-2(T/4)

ø1 ø1 ø1 ø1

ø2 ø2 ø2ø2 Sum

BLP

cross BLprocessor

(CBLP)

Kernel BLP CBLP

Manhattandistance

subtract-compare aggregation

Euclideandistance

subtract-square aggregation

Dotproduct multiply weightedaggregation

Hammingdistance XOR aggregation

charge-redistribution circuits

15.4!m

6Tbitcell

2.11!m

0.915!m

BLprocessor

(BLP)

tight pitch-matching constraints for BLPs

subtractor

11


Pitch-matching in BL Processorsarea overhead < 10%

(512x256 BCA)BLP area overhead reduces with BCA size

12


6T SRAM DIMA Prototypes

[JSSC’18] [ESSCIRC’17JSSC’18]

[ISSCC’18,JSSC’18]

[VLSI’16, JSSC’17]

8b compute in 65nm CMOS (UIUC)

100X EDP reduction over von Neumann equivalent* @ iso-accuracy

Testblock

1.2 mm

1.2 m

m

TRAINER

SRAM BitcellArray

normal R/W Interface

BLP-CBLP

CTRL

ADC

64b

Bus

multi-functional random forest on-chip learning

AdaBoost

* 8b fixed-function digital architecture with identical SRAM size

13

[VLSI’18]

DNN/CNN

Verma (Princeton)


DIMA - A Shannon-inspired Architecture

14


DIMA vs. von Neumann Architecture

von Neumann

BCA(!)

peripherals

L:1 col. mux

per

iphe

rals

perip

herals

" = $

• fundamental EDP vs. SNR trade-off (no free lunch)

• system accuracy derived from its Shannon-inspired roots (affordable lunch)

DIMA → accurate inferences with stochastic computesvon Neumann → accurate inferences with deterministic computes

DIMA

&'()*'()

BCA(!)

+

" = $,- + /

01

noise

21

15

similar delay & energy per read cycle

but DIMA computesa matrix-vector multiply


!" variations in functional READ dominates

DIMA’s Compute SNR

FR r

ow d

ecod

er

FR r

ow d

ecod

er

ADC&

RDLBLP

SA SA SA

Mux & buffer

K-b bus

Input buffer ( )

Cross BL processor (CBLP)

Decision ( )

BLP BLP BLP BLP

L:1col. mux

L:1 col. mux

d0

d1

d2

d3

BLP

L:1col. mux

PrechargeWL

driver

(a) (b)

Figure 1: A comparison of: (a) DIMA showing functional read (FR), bitline processing (BLP),cross BLP (CBLP) and sense amplifier (SA), and (b) the conventional architecture. Wordline(WL) drivers shaded in RED are turned on simultaneously.

(a) (b)

Figure 2: Data-flow of typical inference algorithms: (a) data-flow diagram, and (b) computa-tions required in support vector machine (SVM), template matching (TM), k-nearest neighbor(k-NN), and matched filter (MF) algorithms.

24

3

the BCA size Nrow ⇥ Ncol, the BL precharge voltage VPRE,and the maximum BL swing �VBL,max are identical in botharchitectures.

Since the SRAM in the digital architecture requires an L : 1column multiplexer (typical L = 4, . . . , 16) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), thenumber of bits per read cycle is limited to Ncol/L comparedto NcolB in DIMA’s FR. Therefore, DIMA needs LB timesfewer read cycles to read the same number of bits. However,the read cycle time for DIMA is larger than that of the digitalarchitecture since DIMA’s read cycle includes both data readand compute functions via the FR, BLP and CBLP stages.Ignoring the compute delay of the digital architecture, we findthat the delay reduction factor in reading a fixed number ofbits is given by:

⇢d = LB(Tdigital

TDIMA) =

LB

↵(1)

where the Tdigital and TDIMA are the read cycle times ofthe digital architecture and DIMA, respectively, and ↵ =TDIMA/Tdigital ⇡ 3 or 6 (in 65 nm CMOS) depending on theBLP function being computed. Previous work [9] has shownthat DIMA can realize B 6 with B = 4 being comfortablyrealized. Hence, ⇢d = 5⇥-to-21⇥ is easily achievable withtypical values of B = 4, L = 4, 8, 16 and ↵ = 3.

The dominant sources of energy consumption in a SRAMarray are the dynamic energy consumed to precharge large BLcapacitances CBL every read cycle and due to leakage. Theenergy consumed in reading B bits in the digital architectureand DIMA can be expressed as:

Edigital = LBCBL�VBL,maxVPRE + Elk-digital (2)

EDIMA = �CBL�VBL,maxVPRE +Elk-digital

⇢d(3)

where CBL is the the BL capacitance, VPRE is the BL prechargevoltage, �VBL is the BL (or BLB) voltage discharge causedby WL activation, Elk-digital is the leakage energy of the digitalarchitecture, and 1 � < 2 is an empirical factor determinedby whether DIMA’s FR is a pure B-bit read or incorporates a2B-bit computation. The leakage energy of DIMA is reducedby a factor of ⇢d since the array can be placed into a standbymode after TDIMA duration.

Since the first term in both (2) and (3) is the dominantcomponent of the energy consumption during active mode(CBL is the largest node capacitance in either architecture),their ratio provides the energy reduction factor ⇢e as follows:

⇢e =EDIMA

Edigital=

LB

�(4)

with values for ⇢e = 8⇥-to-32⇥ is easily achievable fortypical values of B = 4, L = 4, 8, 16 and � = 2.

Hence, from (1) and (4), the EDP reduction over a digitalarchitecture enabled by DIMA is given by:

⇢edp = ⇢e⇢d =(LB)2

�↵(5)

which ranges from 2.6⇥-to-2048⇥ of which the prototype ICsin [?] has achieved 100⇥ in the laboratory. This clearly indi-cates that there is significant room to improve upon DIMA’s

0 0.1 0.2 0.3 0.4 0.5 0.6

VBL (V)

0

2

4

6

8

10

12

14

16

18

20

Ener

gy p

er w

ord

(pJ)

Conv. (NROW = 256)Conv. (NROW = 512)Conv. (NROW = 1024)DIMA (NROW = 256)DIMA (NROW = 512)DIMA (NROW = 1024)

Fig. 3: Defining noise and distortion in DIMA.

EDP gains achieved thus far. We identify different methods toachieve this gain in later sections. [I HOPE WE CAN SHOWTHIS]. If the digital architecture’s compute energy were to beincluded in the above analysis, we would find that DIMA’slow-swing analog computation is approximately 10⇥ lowerthan that of the digital architecture.

The plots of the energy models (2)-(3) in Fig. 2 shows thatenergy consumption increases linearly with �VBL and Nrowfor both architectures due to the increased precharge energy.However, DIMA achieves aggressive energy efficiency byamortizing the precharge energy over B⇥Ncol bits comparedto Ncol/L bits in the digital architecture. Figure 2 also showsthat the silicon measured energy consumption reported in[9] for Nrow = 512 and �VBL = 500mV closely matchesthe values obtained from the energy models (2)-(3) (11%modeling error) further validating the energy-delay analysisin this section.

C. Noise and Distortion ModelsThough DIMA provides significant gains in EDP this comes

at the expense of the SNR of its analog computations [?].To see this trade-off, consider the computation of a N -dimensional dot product shown below:

y = WTX =NX

i=1

wixi (6)

where W = [w1, w2, . . . , wN ]T and X = [x1, x2, . . . , xN ]T

are two N -dimensional real-valued vectors of precision BW

and BX , respectively. The dot product computation in (6)is pervasive and most compute-intensive in machine learningalgorithms.

In DIMA, the dot product in (6) is modified to:

y =NX

i=1

[(wi + ⌘f,i)xi + ⌘b,i] + ⌘c (7)

where �VBL,i = wi + ⌘f,i and �VBLP,i = (wi + ⌘f,i)xi + ⌘b,iare the observed noisy outputs of the FR and BLP stages of

3

the BCA size Nrow ⇥ Ncol, the BL precharge voltage VPRE,and the maximum BL swing �VBL,max are identical in botharchitectures.

Since the SRAM in the digital architecture requires an L : 1column multiplexer (typical L = 4, . . . , 16) to accommodatelarge area sense amplifiers (SAs) as shown in Fig. 1(a), thenumber of bits per read cycle is limited to Ncol/L comparedto NcolB in DIMA’s FR. Therefore, DIMA needs LB timesfewer read cycles to read the same number of bits. However,the read cycle time for DIMA is larger than that of the digitalarchitecture since DIMA’s read cycle includes both data readand compute functions via the FR, BLP and CBLP stages.Ignoring the compute delay of the digital architecture, we findthat the delay reduction factor in reading a fixed number ofbits is given by:

⇢d = LB(Tdigital

TDIMA) =

LB

↵(1)

where the Tdigital and TDIMA are the read cycle times ofthe digital architecture and DIMA, respectively, and ↵ =TDIMA/Tdigital ⇡ 3 or 6 (in 65 nm CMOS) depending on theBLP function being computed. Previous work [9] has shownthat DIMA can realize B 6 with B = 4 being comfortablyrealized. Hence, ⇢d = 5⇥-to-21⇥ is easily achievable withtypical values of B = 4, L = 4, 8, 16 and ↵ = 3.

The dominant sources of energy consumption in a SRAMarray are the dynamic energy consumed to precharge large BLcapacitances CBL every read cycle and due to leakage. Theenergy consumed in reading B bits in the digital architectureand DIMA can be expressed as:

Edigital = LBCBL�VBL,maxVPRE + Elk-digital (2)

EDIMA = �CBL�VBL,maxVPRE +Elk-digital

⇢d(3)

where CBL is the the BL capacitance, VPRE is the BL prechargevoltage, �VBL is the BL (or BLB) voltage discharge causedby WL activation, Elk-digital is the leakage energy of the digitalarchitecture, and 1 � < 2 is an empirical factor determinedby whether DIMA’s FR is a pure B-bit read or incorporates a2B-bit computation. The leakage energy of DIMA is reducedby a factor of ⇢d since the array can be placed into a standbymode after TDIMA duration.

Since the first term in both (2) and (3) is the dominantcomponent of the energy consumption during active mode(CBL is the largest node capacitance in either architecture),their ratio provides the energy reduction factor ⇢e as follows:

⇢e =EDIMA

Edigital=

LB

�(4)

with values for ⇢e = 8⇥-to-32⇥ is easily achievable fortypical values of B = 4, L = 4, 8, 16 and � = 2.

Hence, from (1) and (4), the EDP reduction over a digitalarchitecture enabled by DIMA is given by:

⇢edp = ⇢e⇢d =(LB)2

�↵(5)

which ranges from 2.6⇥-to-2048⇥ of which the prototype ICsin [?] has achieved 100⇥ in the laboratory. This clearly indi-cates that there is significant room to improve upon DIMA’s

0 0.1 0.2 0.3 0.4 0.5 0.6

VBL (V)

0

2

4

6

8

10

12

14

16

18

20

Ener

gy p

er w

ord

(pJ)

Conv. (NROW = 256)Conv. (NROW = 512)Conv. (NROW = 1024)DIMA (NROW = 256)DIMA (NROW = 512)DIMA (NROW = 1024)

Fig. 3: Defining noise and distortion in DIMA.

EDP gains achieved thus far. We identify different methods toachieve this gain in later sections. [I HOPE WE CAN SHOWTHIS]. If the digital architecture’s compute energy were to beincluded in the above analysis, we would find that DIMA’slow-swing analog computation is approximately 10⇥ lowerthan that of the digital architecture.

The plots of the energy models (2)-(3) in Fig. 2 shows thatenergy consumption increases linearly with �VBL and Nrowfor both architectures due to the increased precharge energy.However, DIMA achieves aggressive energy efficiency byamortizing the precharge energy over B⇥Ncol bits comparedto Ncol/L bits in the digital architecture. Figure 2 also showsthat the silicon measured energy consumption reported in[9] for Nrow = 512 and �VBL = 500mV closely matchesthe values obtained from the energy models (2)-(3) (11%modeling error) further validating the energy-delay analysisin this section.

C. Noise and Distortion ModelsThough DIMA provides significant gains in EDP this comes

at the expense of the SNR of its analog computations [?].To see this trade-off, consider the computation of a N -dimensional dot product shown below:

y = WTX =NX

i=1

wixi (6)

where W = [w1, w2, . . . , wN ]T and X = [x1, x2, . . . , xN ]T

are two N -dimensional real-valued vectors of precision BW

and BX , respectively. The dot product computation in (6)is pervasive and most compute-intensive in machine learningalgorithms.

In DIMA, the dot product in (6) is modified to:

y =NX

i=1

[(wi + ⌘f,i)xi + ⌘b,i] + ⌘c (7)

where �VBL,i = wi + ⌘f,i and �VBLP,i = (wi + ⌘f,i)xi + ⌘b,iare the observed noisy outputs of the FR and BLP stages of

4

TABLE I: Noise and distortion in DIMA stages [9].

Error type FR (⌘f ) BLP (⌘b) CBLP (⌘c)DP SAD

% distortion (µ) 2.6(1) 2.1(1) 2.5(1) 0.8(3)

% noise (�/µ) (5.3 -to- 22)(2) 2.8(2) 3.2(2) 0.2(3)

(1) silicon measured values(2) Monte Carlo simulations with 0.4V VWL 0.8V.(3) estimated from the capacitor sizes in [9].

the ith column, respectively, and �VCBLP = ⌘c is the thermalnoise at the output of CBLP (see Fig. 1(b)). In practice, wefind that �2

f,i� �2

b,i� �2

cas shown in Table I.

Distortion refers to percentage difference between an idealstraight line transfer function of the block and the mean ofthe measured values over a large number of cells. Thus, thedistortion errors magnitudes for FR and BLP were obtainedfrom silicon measurements in [9]. A dedicated test columnwas used to obtain measured FR outputs while the BLP outputdistortion was back-calculated from measured values of CBLPoutput.

The noise magnitudes for both FR (Vt variation in the accesstransistors) and BLP (TBD) were estimated via Monte Carlocircuit simulations. The CBLP distortion (charge injection) andnoise (thermal) values were calculated from the knowledge ofthe capacitances (25 fF) used in [9].

Note that the noise and distortion error magnitudes for SADare greater than those of DP because SAD [9] is computed byapplying FR to both W and X while DP is computed via anapplication of FR only to W.

It can be seen that the FR stage generates the highest errormagnitude of distortion and noise compared to BLP and CBLP.

The random error in FR stage is also a dominant randomerror source as the BL is discharged by minimum-sizedtransistors in the bitcell and operates in near-threshold voltageregime with low VWL.

In a digital architecture, the precision BY of y is usuallyset as:

BY = BX +BW + log2(N) (8)

where BX and BW are the precisions of the input xi andweights wi, respectively. Equation (8) reflects the typical bit-growth expected in digital arithmetic.

From (??), the BL discharge �VBL represents B bits of theweight wi where as

III. DIMA’S SHANNON-THEORETIC ATTRIBUTES

This section explains DIMA design principles based onSNR budgeting. The ML algorithms have inherent error re-siliency due to following reasons:

1) The thresholding operation into fixed number of classesprovides an algorithmic noise margin as small errors donot change the classification results,

2) The aggregation of a large number of elements, widelyused for the dimensionality reduction in ML algorithms,makes the system insensitive to component noise,

3) The ability to learn error statistics (e.g., retraining)reduces the impact of noise effectively.

(a)

(b)

Fig. 4: Simplified data-flow in: (a) DIMA and (b) the conven-tional architecture.

The inherent noise immunity can be exploited to achieveaggressive energy and throughput benefits for hardware im-plementations. Traditionally, supply voltage scaling has beenwidely employed to achieve energy efficiency, but at the cost ofsignificantly degraded throughput (e.g., quadratically degradedto voltage scaling [22]). Moreover, the error tends to happenat MSB in digital processors due to the carry ripple behaviorin arithmetic operations [23]. Another potential approach toexploit the ML’s error immunity is reducing the �VBL to savememory read energy. However, the conventional architecturewith separate memory and processor does not allow enough�VBL scaling. This is because the �VBL needs to be convertedto full swing to be processed in digital logic through SA, whichacts as bit-wise early decisions (Fig. 4(a)). If early decisionerrors occur at MSB by �VBL scaling, the error magnitudecan increase beyond application’s error immunity, leading toa catastrophic failure of the inference system (e.g., peak signal-to-noise ratio (PSNR) < 10dB with MSB error [4], [5]). Beingaware of this issue, several low-power SRAM techniques[4], [5], [24] were proposed for inference applications withselective MSB protection achieving limited energy savings inthe memory, but not in the processor.

To resolve these issues, DIMA eliminates the separationbetween the low-swing memory and high-swing processor, butbudgets SNR by employing aggressively low-swing operationsin both memory and process jointly (Fig. 4(b)). This is enabledby employing the FR, which implicitly performs D/A conver-sion, and subsequent mixed-signal BLP and CBLP process-ings. Despite degraded SNR, the system level robustness ofDIMA is maintained by following three principles: (1) delayeddecision, (2) non-uniform bit protection, and (3) aggregation.

digital DIMA

FR BLP CBLP

16


DIMA – A Shannon-inspired Perspective

• accuracy via delayed decision, unequal protection, massive aggregation

DIMA Digital

644 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 53, NO. 2, FEBRUARY 2018

Fig. 2. Dataflow of typical inference algorithms. (a) Dataflow diagram.(b) Computations required in SVM, TM, k-NN, and MF algorithms.

Fig. 3. Memory access pattern and bitline swing (!VBL) in (a) DIMA and(b) conventional architecture, assuming B = 4 and L = 4. Bitcells marked inred are accessed simultaneously.

accommodate large area sense amplifiers (SAs) as shownin Fig. 2(b), which limits the number of bits per accessto Ncol/L in standard SRAM compared to Ncol B in FR.Moreover, unlike the digital architecture, DIMA’s Ncol BLPs

TABLE I

DIMA VERSUS CONVENTIONAL ARCHITECTURE(WITH Ncol × Nrow BCA)

Fig. 4. Simplified dataflow in (a) DIMA and (b) conventional architecture.

and the CBLP are pitch-matched to the BCA avoiding the needto fully read out data (see Table I and Fig. 3 for a comparisonsummary).

Though DIMA needs much fewer precharge cycles toread the same number of bits leading to both energy andthroughput gains, it relaxes the fidelity/accuracy of its readsand computations. However, this loss is easily absorbedby the system-level robustness inherent in DIMA’s func-tional dataflow. A comparison of DIMA’s functional dataflow[Fig. 4(a)] with that of the conventional architecture [Fig. 4(b)]shows that: 1) the FR stage combined with the column-majorformat enables DIMA to assign significantly (exponentially)more weight (pulse width [21] or pulse amplitude [27]) toMSBs versus LSBs (significance-weighed bit protection) and2) DIMA makes its first and only hard decision (thresholding)in the final RDL stage thereby enabling the CBLP to enhance

644 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 53, NO. 2, FEBRUARY 2018

Fig. 2. Dataflow of typical inference algorithms. (a) Dataflow diagram.(b) Computations required in SVM, TM, k-NN, and MF algorithms.

Fig. 3. Memory access pattern and bitline swing (!VBL) in (a) DIMA and(b) conventional architecture, assuming B = 4 and L = 4. Bitcells marked inred are accessed simultaneously.

accommodate large area sense amplifiers (SAs) as shownin Fig. 2(b), which limits the number of bits per accessto Ncol/L in standard SRAM compared to Ncol B in FR.Moreover, unlike the digital architecture, DIMA’s Ncol BLPs

TABLE I

DIMA VERSUS CONVENTIONAL ARCHITECTURE(WITH Ncol × Nrow BCA)

Fig. 4. Simplified dataflow in (a) DIMA and (b) conventional architecture.

and the CBLP are pitch-matched to the BCA avoiding the needto fully read out data (see Table I and Fig. 3 for a comparisonsummary).

Though DIMA needs much fewer precharge cycles toread the same number of bits leading to both energy andthroughput gains, it relaxes the fidelity/accuracy of its readsand computations. However, this loss is easily absorbedby the system-level robustness inherent in DIMA’s func-tional dataflow. A comparison of DIMA’s functional dataflow[Fig. 4(a)] with that of the conventional architecture [Fig. 4(b)]shows that: 1) the FR stage combined with the column-majorformat enables DIMA to assign significantly (exponentially)more weight (pulse width [21] or pulse amplitude [27]) toMSBs versus LSBs (significance-weighed bit protection) and2) DIMA makes its first and only hard decision (thresholding)in the final RDL stage thereby enabling the CBLP to enhance

delayed decision

unequal protection

massiveaggregation

in analog

early decision

equalprotection

aggregation in digital

data → noisy channel → decoder

final decision

17


• a key rule in decision making – for accurate decisions delay making hard decisions

• DIMA follows this rule; digital arch. violates it → suffers from MSB errors

• DIMA’s accuracy ≥ digital in low-SNR regime – [Feb. JSSC’18 (UIUC)]

• accuracy gap increases with dimensionality # of data $

Delayed Decision7

capacitance. This results in less process variation at given BLswing as shown in Fig. 4(b).

The fundamental trade-off between energy vs. SNR is wellobserved in Fig. 4(b), where noise is suppressed (doing so,SNR increases) with higher BL swing at the cost of degradedenergy efficiency. Therefore, it is critical to find the optimal BLswing (and VWL), which minimizes the energy consumptionwithout degrading the application-level accuracy.

V. ROBUSTNESS OF DIMAThis section explains DIMA design principles based on

SNR budgeting. The ML algorithms have inherent error re-siliency due to following reasons:






A. Delayed DecisionDue to the absence of SAs between memory and processor,

DIMA does not require early decision right after BL dischargewhere noise ⌘1 is added, but hard decision (thresholding)

FR BLP

U

25

FR BLP

26

FR BLP

2«

+ 1/N

CBLP

······

$

$

$ analog processing

&5!

&6!

&«!

···

f( )

(a)

&5! Read ×

25

Read ×

26

Read ×

2«

+·········

U1/N

$

$

$

$

$

$

$

$

$

$···

&6!

&«!

SAs

$

$

$

f( )

(b)


happens only at the end of classification process (Fig. 5(b)),namely delayed decision. Consider the example of binarybit transmission (x 2 {1,�1}) in Fig. 6(a), where noisedistributions ⌘1 and ⌘2 are modeled by N (0,�2), thresholdlevel 0, and equal priors P (X = 1) = P (X = �1) = 0.5are assumed. The bit error probabilities pe, assuming identicalindependent noise sources, can be derived as follows:

pe =

(2Q

�1�

� �1�Q

�1�

� (early decision)

Q⇣

1p2�

⌘(delayed decision)

(17)

Figure 6(b) shows pe behavior with respect to the SNR basedon (17). It is clearly shown that the delayed decision is morerobust in a low-SNR regime as the early decision tends toamplify noise contribution in the regime. In this sense, thedelayed decision improves the robustness of DIMA, wherethe SNR is tightly budgeted.

B. Non-Uniform Bit ProtectionThe FR with PWM (Fig. 1(c)) allows us to assign voltage

swing unequally based on the significance of information (bitpositions), namely non-uniform bit protection. When the totalswing �VBL,T is budgeted for the reading of B-bit number, theswing assigned for the b-th bit position �Vb for conventionalarchitecture and DIMA are given by

�Vb,CONV =�VBL,T

B, (18)

�Vb,DIMA =2b

2B � 1·�VBL,T. (19)

For example, the non-uniform bit protection gives roughly2.1⇥ more swing for MSB and 3.8⇥ less swing for LSB

7

capacitance. This results in less process variation at given BLswing as shown in Fig. 4(b).

The fundamental trade-off between energy vs. SNR is wellobserved in Fig. 4(b), where noise is suppressed (doing so,SNR increases) with higher BL swing at the cost of degradedenergy efficiency. Therefore, it is critical to find the optimal BLswing (and VWL), which minimizes the energy consumptionwithout degrading the application-level accuracy.

V. ROBUSTNESS OF DIMAThis section explains DIMA design principles based on

SNR budgeting. The ML algorithms have inherent error re-siliency due to following reasons:






A. Delayed DecisionDue to the absence of SAs between memory and processor,

DIMA does not require early decision right after BL dischargewhere noise ⌘1 is added, but hard decision (thresholding)

FR BLP

U

25

FR BLP

26

FR BLP

2«

+ 1/N

CBLP

······

$

$

$ analog processing

&5!

&6!

&«!

···

f( )

(a)

&5! Read ×

25

Read ×

26

Read ×

2«

+·········

U1/N

$

$

$

$

$

$

$

$

$

$···

&6!

&«!

SAs

$

$

$

f( )

(b)


happens only at the end of classification process (Fig. 5(b)),namely delayed decision. Consider the example of binarybit transmission (x 2 {1,�1}) in Fig. 6(a), where noisedistributions ⌘1 and ⌘2 are modeled by N (0,�2), thresholdlevel 0, and equal priors P (X = 1) = P (X = �1) = 0.5are assumed. The bit error probabilities pe, assuming identicalindependent noise sources, can be derived as follows:

pe =

(2Q

�1�

� �1�Q

�1�

� (early decision)

Q⇣

1p2�

⌘(delayed decision)

(17)

Figure 6(b) shows pe behavior with respect to the SNR basedon (17). It is clearly shown that the delayed decision is morerobust in a low-SNR regime as the early decision tends toamplify noise contribution in the regime. In this sense, thedelayed decision improves the robustness of DIMA, wherethe SNR is tightly budgeted.

B. Non-Uniform Bit ProtectionThe FR with PWM (Fig. 1(c)) allows us to assign voltage

swing unequally based on the significance of information (bitpositions), namely non-uniform bit protection. When the totalswing �VBL,T is budgeted for the reading of B-bit number, theswing assigned for the b-th bit position �Vb for conventionalarchitecture and DIMA are given by

�Vb,CONV =�VBL,T

B, (18)

�Vb,DIMA =2b

2B � 1·�VBL,T. (19)

For example, the non-uniform bit protection gives roughly2.1⇥ more swing for MSB and 3.8⇥ less swing for LSB

DIMA Digital

hard decision

hard decisionhard decision

18


Massive Aggregation9

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

(!/#

)of∆V B

LB

StoredvalueW (4-b)

65%

0100(2)

1000(2)

Fig. 8: Measured normalized standard deviation (�/µ) of BLswing with respect to the 4-b stored data when BL swing perbit = [80]mV.

C. AggregationThe charge-sharing process in the CBLP efficiently averages

out noise contributions by reducing the standard deviationof output by

pN times after aggregating N independent

mean-zero random noise sources in Fig. 5(b). Specifically, theDIMA’s CBLP is effective in averaging out the noise in theSD computations as the error statistics of noise in the analogcircuitry closely follows independent mean-zero behavior (e.g.,spatial threshold variation).

Foe example, DIMA computes the SAD in the analogdomain as follows:

v =1

N

NX

i=1

vi (26)

where vi denotes the voltage swing corresponding to |Di�Pi|in Fig. 5(a). By taking into account the noise, (26) can bemodified into

bv =1

N

NX

i=1

(vi + ni) = v + n (27)

where bv denotes the noisy SAD output due to nl ⇠ N (0,�2).Since nis are i.i.d., we note that

n =1

N

NX

i=1

ni ⇠ N✓0,

1

N�2

◆. (28)

This behavior was well observed from the measurement ofsilicon IC prototypes in [9]. The normalized standard deviation(�/µ) of each BL was maximum 12%, but the output of CBLPblock has only 1.1% after charge-sharing 128 BLs. Moreover,DIMA provides the average-out opportunity for both ⌘a1and ⌘a2 before the thresholding due to the delayed decisionwhereas the conventional digital architecture goes through theearly decision before having the aggregation opportunity.

Additionally, FR process also contains such averaging-outeffect as FR accumulates the amount of discharge by multipletransistors. Figure 8 shows that the measured normalizedstandard deviation (�/µ) of BL swing is larger when theBL is discharged by single transistor, e.g., 4 = 0100(2) or

8 = 1000(2). On the other hand, the variation is suppressed byup to 65% when the BL is discharged by multiple transistorsdue to the averaging-out during FR.

VI. SYSTEM-LEVEL ACCURACY PREDICTION

In this section, the system-level performance is predictedin terms of probability of detection pdet with major designparameters based on the behavioral models. By deriving theprobabilities of detection for conventional system and DIMA,we show that the DIMA can perform robust classificationtasks. We analyze two different tasks TM with SAD kerneland SVM with dot product kernel.

A. Template MatchingIn TM, the SAD values between an input query template

and multiple candidates are measured to find the candidatewith minimum SAD. In conventional architecture, the SAD isgiven by

s =1

N

NX

i=1

xi (29)

where xi = |Di � Pi|. If we consider the circuit non-idealities,then (29) is modified into

bs = 1

N

NX

i=1

(xi + ei) = s+ e (30)

wheres bs represents the noisy SAD output. The circuit non-idealities result in the decimal error ei and e = 1

N

PN

i=1 ei.Suppose that there are two images D(i) and D(j). Their

corresponding SAD outputs are s(i) and s(j), respectively.If s(i) < s(j), then the TM output will become D(i) sinceits Manhattan distance to P is closer than D(j). The TMdecision can be changed due to the circuit non-idealities ifbs(i) > bs(j). The mismatch probability that D(j) is incorrectlychosen instead of D(i) is given by

p(i!j)CONV = Q

↵(i,j)

TM

s3(2B � 1)

2(2B + 1)· Np

!(31)

where ↵(i,j)TM denotes the normalized decision margin such as

↵(i,j)TM =

��s(j) � s(i)��

(2B � 1)(32)

where 0 ↵TM 1. We observe that the impact of bit errorprobability on p(i!j)

CONV would be suppressed by a larger N (Thedetailed derivation can be found in Appendix B).

Without loss of generality, suppose that D(0) =mini2{0,...,M�1}

��P�D(i)��. Then, the detection probability

of conventional architecture is given by

Pdet,CONV =M�1Y

m=1

(1� p(0!m)CONV )

=M�1Y

m=1

1�Q

↵(0,m)

TM

s3(2B � 1)

2(2B + 1)· Np

!!.

(33)

[ISSCC’18 (UIUC)]!" = 12% !

" = 1%

• DIMA’s analog domain aggregation boosts SNR

– SNR boost factor = '()* (Central Limit Theorem)

• Digital architectures miss out on this opportunity

19


Future Prospects

20


q analog-aware ISAq Julia (compiler) software stackq energy-accuracy optimization passq 22X EDP reduction for a 8b, 5-layer DNN

Scaling-up DIMA

aSD

aVD

PWM

WL

driv

er

512 X 2566T SRAM

Bit-cell Array

Bit�cell

Bit-cell

Read/Write path

X

aSD

CTRL

Task

1Ta

sk2

Task

3Ta

sk4

Bit�cell

Bit�cell

X-REG [7:0][127:0]

decision

Clas

s 2(a

SD, a

VD)

Clas

s 1 (a

READ

)

ADC

Clas

s 4 (T

H)

Write data buffer Binary Inst.

currentTP

PWM

WL

driv

er

meanthres

sigmoidReLu

minmax

acc

Cont

rol s

igna

ls Clas

s 3 (A

DC)

W

(a)

(b)

Figure 2: The PROMISE architecture: (a) a single-bankarchitecture with NROW = 512, NCOL = 256, analog pro-cessing marked in red, and (b) a multi-bank architecture.

prevent the noise from analog operations accumulating overthe iterations. This digitization also enables a multiple-bankarchitecture, where reliable data transfers between banksare required. TH implements non-linear operations such assigmoid via piece-wise linear approximation [27] not onlyto compute the decision functions f () in Table I but alsoto aggregate intermediate computed values when the vectorlength N is larger than 128.

Lastly, X-REG is a digital block similar to a vector registerfile in a SIMD processor, holding eight 128-element vectorsrepresenting eight X values. CTRL is a controller to generateenable signals for the aforementioned components based on agiven instruction and make CM function as a programmablemixed-signal accelerator.Multiple bank architecture: PROMISE can be extended toa multi-bank structure (Fig. 2(b)), which has multiple (upto eight in this work) PAGEs, each of which includes fourbanks. Thus, long (> 128) vectors can be distributed acrossmultiple banks for parallel processing. PROMISE does nothave the scalability limitation by the analog operations asthe partial results from each bank are always converted todigital values by the ADCs. Then, the partial sums can beaggregated across different banks via cross-bank rail, similarto H-Tree in [12]. The data transfer of 8-bit ADC outputfrom each bank to the other through the cross-bank rail

Figure 3: Analog pipeline in PROMISE with operatingfrequency = 1/TP, and red marked area is analog domain.

takes only 0.5 pJ with activity factor of 0.5 from post-layoutsimulations. This is negligible (< 1%) as a aREAD consumesup to 131 pJ/bank as listed in Table III. The control of multi-bank by the instruction is discussed in Section III-C.Analog pipeline architecture: To achieve higher throughputthan the original CM [9], PROMISE adopts analog pipelin-ing as shown in Fig. 3. This requires analog Flip-Flops(FFs) to hold the output of each stage. The first pipelinedstage (aREAD) needs NCOL analog FFs whereas the secondpipelined stage consisting of aSD and aVD needs only oneanalog FF to store the aggregated scalar output. Thus, wepropose to reuse a pre-existing capacitor, which is a part ofthe analog multiplier [9] in the aSD block, as an analog FFfor the aREAD stage to minimize the hardware overhead.

B. Challenges in Programmable Mixed-Signal Accelerator

The combination of algorithmic diversity and mixed-signaloperations in PROMISE creates many challenges that weshould consider when exploring ISA design choices. Tosupport a broad range of ML algorithms, each stage needsto support diverse programmable operations. However, weidentify some challenges in supporting (too) many diverseprogrammable operations which can significantly degradeboth throughput and accuracy.Higher Impact of Delay Variation across Operations:Analog operations often exhibit a wide range of delaysdepending on operation types [9], while each stage mustaccommodate the worst case delay out of all the operations forthe stage. Furthermore, as all the stages of a pipelined mixed-signal accelerator need to operate at the same clock period TP,TP needs to accommodate the worst case delay across all thestages, i.e., TP = max(TS1 max,TS2 max,TS3 max,TS4 max), asillustrated in Fig. 4). That is, supporting more programmableoperations causes longer idle time for some stages. As aresult, we observe up to 2⇥ throughput degradation whendesigning for the range of operations supported in this work.Furthermore, such long idle time negatively affects accuracy

at the algorithm level because analog values are typicallystored on (area-constrained) capacitors and thus degrade overtime due to various leakage mechanisms. This problem isespecially bad for the S1 stage (aREAD) as each BL issubject to the leakage contributions from all the bit-cells inthe column (up to 0.6%/ns).More Limit in Sequence of Operations: We also recog-nize some other challenges specific to analog processing,

4

!"!#!$!"!#!$%&'&()!$

%&'&()!#

!*!*

!" !" !" !" !"!" = max !#$_%&',!#(_%&',!#)_%&',!#*_%&'

Figure 4: 4-stage processing in PROMISE with opera-tional diversity per stage.

which further limits the number of possible programmableoperations. Specifically, a chain of analog processing stagesimposes intrinsic sequentiality of operations where twoconsecutive stages need to be physically closely placed.This is done to avoid substantial degradation in analogvoltage from one stage to the next. Furthermore, a largecapacitor ratio between consecutive stages (input-output 20:1to maintain the voltage drop < 5%) is required to transfer thesignal via charge-transfer mechanism without an additionalanalog buffer.

C. PROMISE Instruction Set

We explore an ISA suitable for a programmable mixed-signal accelerator, considering the above constraints.Instruction Format: We propose a wide-word macro instruc-tion format, which is referred to as Task. Akin to a Very-Large Instruction Word (VLIW), a single Task consists ofmultiple operations, except that the operations are sequentialand not parallel like VLIW architectures. As depicted inFig. 5(a), the four Class fields specify four operations forfour pipelined stages of PROMISE, while the three otherfields, OP_PARAM, RPT_NUM and MULTI_BANK configureall or specific Class operations. More specific descriptionsof these seven fields are as follows.Operating Parameter Field: OP_PARAM comprises 33 bitsand configures operating parameters of Class operationsin a given Task, facilitating flexible programmability. Asshown in Fig. 5(b), W_ADDR specifies a CM address fora Class-1 operation. X_ADDR1 and X_ADDR2 designateX-REG addresses for Class-1 and Class-2, respectively.SWING controls BL swing ∆VBL, e.g., 111 allows 30mV/LSB whereas 001 allows 5 mV/LSB. This parameteris a key knob to control the trade-off between energy andaccuracy under software control; Section VI evaluates thisaccuracy-energy trade-off for further energy savings. Referto Fig. 5(c) for descriptions of the remaining parameters andtheir assignments of bit fields.Class Fields: Class-1 is composed of 3-bit opcode anddefines six possible memory operations. READ, WRITE, oraREAD makes CM perform a digital read, digital write, oranalog read operation to a compute-memory address specifiedby OP_PARAM (W_ADDR). aADD or aSUB fuses an analogread and an element-wise analog addition or subtractioninto a single operation where two vector operands come

!"#$!"#"$%$&'()!*+,-.

%"/#01&'2 !*+,-.

&13/45$06'(!*+,-.

789--:;'<!*+,-.

789--:('=!*+,-.

789--:<';!*+,-.

789--:='<!*+,-.

(a)

Contents Bits Description

SWING [27:25] ∆VBL swing code –000: min (5 mV/LSB), 111: max (30 mV/LSB)

ACC_NUM [24:23] # of operands to be accumulated for accumulateopcode in Class-4

W_ADDR [22:14] Bit-cell array address of W in Class-1X_ADDR1 [13:11] Bit-cell array address of X in Class-1X_ADDR2 [10:8] X-REG array address of X in Class-2X_PRD [7:6] X_ADDR1 & 2 circulate from 0 to “X_PRD - 1”DES [5:4] Class-4 output destination - 00: ACC input, 01:

output buffer, 10: X-REG, 11:Write data bufferTHRES_VAL [3:0] Thresholding reference value for threshold opcode

in Class-4

(b)

!"#$$ %&'(#)*+,!-+&'(#,./ 0*)!"',1)2 %3!%45 %&)*+,

!

!"!#

"!#$%&'!"#!$%&

'''

$%&'((!$%!((') )*+),-./012+34!'!/3!5$%&'(678

*+,-#./%!(('0 ''7+#12./%!(('0 '7'1'3!(./%!(('0 '77

145"6./%!(('7#$%!((')0# 7''1!((./%!(('7#$%!((')0 7'7

9

!"!#

: !;*/1<!"#!$%=!.>$&

''' .>$ ;*/('(!?3 .@@+0@./*3?7(!.@@+0@./*3?

$%&'((!$%!(('8 )*+),-./012+34!'!/3!5$%&'(678

9":;1+# ''71<=">?-# '7'=@?1+# '77

=,A!%:?>-.$%!(('80 7''?!=,A!%:?>-.$%!(('80 7'7

A !"!# 7!;*/<!"#!$%&

'!(B 7

:

199?:?>1-,"!

A!;*/1<!"#!$%&

''' (34B!!BB%C5D:#1! ''7 (34

-E+#=E">2 '7' (34B!6F'34%G!H:1I '77 (34:,! 7'' (34

=,A:",2 7'7 (34'#H? 777 (34

(c)

Figure 5: Instruction set of PROMISE: (a) instructionformat, (b) operation parameters (OP_PARAM), and (c)operations in each Class.

from compute-memory addresses specified by OP_PARAM(W_ADDR and X_ADDR1), respectively.Class-2 consists of 4-bit opcode and specifies a com-

position of one of six possible aSD operations with one oftwo possible aVD operations. Specifically, aSD operatingon a computed value from Class-1 supports three unaryoperations: compare, absolute, and square and twobinary operations: sign_mult and unsign_mult wherethe other operand comes from an X-REG address specifiedby OP_PARAM (X_ADDR2). aVD specifies whether an ag-gregation should be performed or not after an aSD operation.Class-3 and Class-4 comprise 1- and 3-bit op-

code and control whether an ADC should be performedor not and specify one of seven possible TH operations,

47

DIMA accelerator (ISCA’18, UIUC) multi-bank DIMA (VLSI’18, Princeton)

instruction set

o 8T BC; 2.4Mb BCA; 64 bank; 1b compute; charge-based summing;

o accuracy - SVHN (94%); CIFAR-10 (83%) with 9-layer ConvNets

o 658 TOPS/W

21

[2019 IEEE MICRO Top Picks Honorable Mention]


q larger BL caps, smaller BCs, slower devices, higher !" variations

DIMA in Flash & Emerging Memories

q high conductance, small TMR, high MTJ variations, write endurance

…

…

…

…

Δ#$,&

'&&,(

'&)

Δ#$,) …

…

'&&,&

'&&,)

'&&,*

'&&,+

ADC ,&-.

WL0

WL4

WL1

WL2

WL3

WLdrivers

CurrentIntegrator

(CI)

Functional Read (D/A)

X-D

ec. &

Pul

se G

en.

16kB Input/Weights Reg.

BL Processor(BLP)

(multIplier)

Cross BL processor (CBLP)

BL Processor(BLP)

(multiplier)

WBL[0] WBL[255]

Ana

log

Proc

esso

r

WL

Dri

ver

16k Multiplier

128k X 192kSLC NAND Flash Array

Out

Weights

X-Dec. &

Pulse Gen.

WL D

river

A/D

BSL

BL

WL<63>

WL<0>

SSL

WL<62>Vread

Vpass

Vpass

Flash-based DIMA (ISCAS’18) MRAM-based DIMA (ISCAS’19)

[UIUC, Micron] [DARPA ERI: UIUC, Princeton, GLOBALFOUNDRIES, Raytheon]

22


Objective: Explore DIMA acceleration of customized neuro-inspired algorithms for

Automatic Target Recognition

Sandia-UIUC Project: DIMA Acceleration of ATR Workloads

23

and a stride greater than one. In most cases, the pooling replacement was done immediately before the classical convolutional layer. The stride and size greater than one allows the layer to reduce the area dimension, and to move up the feature hierarchy. Therefore, the convolutional layer is able to act as a pooling layer. An example of two layers in the EEDN model is shown in Fig. 3.

Each group also contains several network-in-network (NIN) layers. Each convolutional layer acts as a classifier for some type of feature, but alone they are limited to be linear classifiers. NIN layers contain kernels with an area of one, but a very large depth. They help to fine-tune the classifications of the traditional convolutional layer by building more complex classifiers and allowing more configurable parameters. They are sandwiched between traditional convolutional layers and pooling layers, but do not change the data dimensions.

The EEDN software [9] is a deep learning toolkit [6] developed by IBM to work with TrueNorth processor. It uses the MatConvNet framework to train networks from MATLAB, then it converts them to run on TrueNorth using the Corelet programming language. While TrueNorth supports a variety of parameters for the synaptic weights and behaviors, the EEDN toolkit only allows three deterministic weights: [-1, 0, 1]. Therefore, during the training process, the loss function included a factor that encourages the then floating point weights to settle where it is easy to convert them. When they are converted during testing, usually some accuracy loss occurs.

The EEDN software and NSCS simulator were used to estimate the accuracy of variations of the original network on TrueNorth. Once the image database was created, the model was trained from Matlab on a conventional GPU-accelerated server. Throughout training, the software produces an estimation of the performance on TrueNorth, along with validating on the validation dataset. At the end, the model was tested with a separate testing set using the NSCS simulator. The NSCS simulator, while not truly the TrueNorth hardware, did give us the best estimate of how well the network would do on the hardware itself. Finally, certain models were selected, then tested on the TrueNorth NS1e board, and measured for the speed of computation, accuracy, and the active energy per image classification.

B.� EEDN Model for SAR Imagery Classification We build our models based on the DCNN model layers

supported by the EEDN software. A full-size model is designed to use almost exactly one chip, 4,064 cores during testing. It is composed of 15 convolutional layers, however with the NIN layers discounted, and the convolutional layers that down sample act as the pooling layers it is roughly comparable to a network with 4 convolutional layers and 4 pooling layers as defined in Caffe [5].

The full-size model used contained more feature planes than most previous classification networks for this data. The purpose of this is to allow some redundancy within the network, to reduce the loss when the precision of the synaptic weight is reduced from floating point values to the restricted set of [-1, 0, 1]. In some ways, that loss of precision can help to act similarly to a dropout layer. Having the extra planes helps the network

average across that precision loss. However, a balance needs to be struck between using too large of a network (and encouraging over fitting), and having too small of a network which would lose too much vital information when converted. Comparing the training, test, and validation accuracy across different conditions accessed that balance. If the training accuracy is high, but the validation and testing accuracies are low, then it was likely the network was over fitting the data, and hence was too large. However, if the training, validation and testing accuracies are relatively high prior to conversion, but testing on hardware after conversion went poorly, then the network is too sensitive to that loss of precision, and needed to be bigger.

There are several variations on the original model tested. In one set, the number of outputs (feature planes) of each layer is scaled evenly by factors of 1/8. In another, the preprocessing layer is left untouched, but the rest of the network was scaled evenly. This is to test the effect of the “extra” feature planes on the robustness of the network. While the model is scaled evenly, the number of active cores does not scale linearly with the number of output layers.

Also tested is the network, scaled the same by factors of 1/8, but with all of the network-in-network layers removed. This is to test the hypothesis that those extra layers were serving to help add complexity to the kernels, improving accuracy. It was suspected that removing these layers would greatly reduce the accuracy of the model.

IV.�EXPERIMENTAL SETUPS AND RESULTS

A.� The Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Dataset The dataset used for this experiment is composed of

Synthetic Aperture Radar (SAR) images from the MSTAR public dataset. The images are taken by a side-looking radar antenna of a plane passing overhead. The magnitude and phase of the radio waves are collected, and the magnitude data can be converted into a grayscale JPEG image. Once the image is converted, the target is in the center of an approximately 128×128-pixel image. The dataset were primarily collected from two different depression angels, 15 degrees and 17 degrees. In existing machine learning based research on this dataset, the 17-degree data are usually used for training, and the 15-degree data for testing.

There are 10 target classes in the dataset, consisting of mostly military vehicles. Fig. 4 shows the sample SAR imageries of the ten classes of targets. There are between one

and three physical representations of each vehicle, and some Fig. 4. Sample imageries and their class labels in the MSTAR public dataset.

MSTAR Dataset

Neuro-inspired Algorithms

Deep In-memory Architecture (DIMA)

Silicon-validated Noise, Energy,

Latency metrics

MPM/DNN/CNN

Cost-benefit Analysis


§ accuracy > 99%

§ fixed-point < 0.07% drop

§ 6.5X reduction in storage

Convolutional Neural Networks for ATR

24

MorganNet [Algorithms for SAR‘15] WagnerNet [TAES’16]

Model Data Augmentation QuantizationScheme

Number of Parameters

Storage Needed (KB)

Test Accuracy

(%)

WagnerNet None FL 410330 1602 97.22

WagnerNet ED(10,3) + ED(20,3) FL 410330 1602 99.21

QWagnerNet ED(10,3) + ED(20,3) FX-5-4 410330 250 99.40MorganNet None FL 88162 344 97.81MorganNet ED(10,3) + ED(20,3) FL 88162 344 99.13

QMorganNet ED(10,3) + ED(20,3) FX-5-4 88162 54 99.06

Algorithm Simulations


§ functional read noise is data-dependent: !"~$ 0, '()*+ → validated in silicon

§ model complexity v.s. noise immunity tradeoff

§ energy vs. accuracy trade-off → )* reduces with higher WL voltage

DIMA Inference Accuracy on the MSTAR Dataset

25

Next Steps

§ Develop energy-delay-accuracy models of DIMA components§ Map DNNs/CNNs to DIMA§ Study system-level trade-offs in realizing ATR in DIMA§ Benchmark against a digital architecture baseline.


The Future

!

computingplatform

finiteenergysource

distributedsensors

"

#

DECISIONS

distributedsensors

DATA

memory

processor

decisionlatency

DATA

energy/decision

decisionaccuracy

DIMA-enabled Autonomous Agents

DIMA

fully autonomous functionality via in-situ information processing of rich multimodal sensory inputs under

severely constrained volume, energy & computational resources

26


Acknowledgments

• Student Collaborators – Mingu Kang, Sujan Gonugondla, Ameya Patil, YongjuneKim, Charbel Sakr, Hassan Dbouk, Kuk-Hwan Kim,….

• Faculty Collaborators – SONIC faculty including Naveen Verma, Lav Varshney, Pavan Hanumolu, Boris Murmann, David Blaauw, Jan Rabaey, Subhasish Mitra, Philip Wong….

• Sponsors - National Science Foundation, the Defense Advanced Projects Agency, the Semiconductor Research Corporation, Texas Instruments, Micron, IBM, Raytheon Missile Systems, GLOBALFOUNDRIES, Intel Corporation, Sandia National Labs.

• Research Programs (Centers) – FCRP (GSRC), STARnet (SONIC), JUMP (C-BRIC), Intel (ISTC)

27

naresh shanbhag university of illinois at urbana-champaign ... · design principles for low...

Documents