experimental demonstration and tolerancing of a large ... · experimental demonstration and...

Experimental demonstration and tolerancing of a large-scale neural network(165,000 synapses), using phase-change memory as the synaptic weight element

G.W.Burr, R.M.Shelby, C.di Nolfo, J.W.Jang, R.S.Shenoy, P.Narayanan,K.Virwani, E.U.Giacometti, B.Kurdi, and H.Hwang

IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, Tel: (408) 9271512, E-mail: [email protected] University of Science and Technology, Pohang, South Korea

AbstractUsing 2 phase-change memory (PCM) devices per synapse, a

3layer perceptron network with 164,885 synapses is trained ona subset (5000 examples) of the MNIST database of handwrittendigits using a backpropagation variant suitable for NVM+selectorcrossbar arrays, obtaining a training (generalization) accuracy of82.2% (82.9%). Using a neural network (NN) simulator matched tothe experimental demonstrator, extensive tolerancing is performedwith respect to NVM variability, yield, and the stochasticity, linear-ity and asymmetry of NVM-conductance response.

IntroductionDense arrays of nonvolatile memory (NVM) and selector device-

pairs (Fig.1) can implement neuro-inspired non-Von Neumann com-puting [1,2], using pairs [2] of NVM devices as programmable(plastic) bipolar synapses. Work to date has emphasized the Spike-Timing-Dependent-Plasticity (STDP) algorithm [1,2], motivatedby synaptic measurements in real brains, yet experimental NVMdemonstrations have been limited in size (100 synapses).

Unlike STDP, backpropagation [3] is a widely-used, well-studiedNN, offering benchmark-able performance on datasets such as hand-written digits (MNIST) [4]. In forward evaluation of a multilayerperceptron, each layers inputs (xi) drive the next layers neurons

Selector deviceSynaptic weight

NVMN1

N2

N1

M1

M2P

P1

O1pairs

Conductance

1

N2

P2

O2

Nn

Oo+ - + -

Nn MmMmMm Pp

M1+

Fig. 1 Neuro-inspired non-Von Neumann computing [1,2], in which neuronsactivate each other through dense networks of programmable synaptic weights,can be implemented using dense crossbar arrays of nonvolatile memory (NVM)and selector device-pairs.

250 125 10528 250hiddenneurons

125hiddenneurons

10outputneurons

0

x1Cropped(22x24

528input

neurons

A

A

x1B

wij

(22x24pixel)MNISTimages

1

xiA

xjB

ij

x wA

xj8

B

xi wij

xj =f(xi wij)

xj9

x528

B AA

x250B

Fig. 2 In forward evaluation of a multilayer perceptron, each layers neuronsdrive the next layer through weights wij and a nonlinearity f(). Input neuronsare driven by pixels from successive MNIST images (cropped to 2224); the 10output neurons identify which digit was presented.

through weights wij and a nonlinearity f() (Fig. 2). Supervisedlearning occurs (Fig. 3) by back-propagating error terms j to ad-just each weight wij . A 3layer network is capable of accuracies,on previously unseen test images (generalization), of 97% [4](Fig.4); even higher accuracy is possible by first pre-training theweights in each layer [5]. Like STDP, low-power neurons should beachievable by emphasizing brief spikes[7] and local-only clocking.

Considerations for a crossbar implementationBy encoding synaptic weight in conductance difference between

paired NVMs, wij = G+G [2], forward propagation simplycompares total read signal on bitlines (Fig.5). However, backprop-agation [3] calls for weight updates w xij (Fig. 6), requiringupstream i and downstream j neurons to exchange information foreach synapse. In a crossbar, learning becomes much more effi-cient when neurons modify weights in parallel, by firing pulseswhose overlap at the various NVM devices implements training [1](Fig.7). Fig.8 shows, using a simulation of the NN in Figs.2,3, thatthis adaptation for NVM implementation has no effect on accuracy.

However, the conductance response of any real NVM deviceexhibits imperfections that could still decidedly affect NN per-formance, including nonlinearity, stochasticity, varying maxima,asymmetry between increasing/decreasing responses, and non-re-sponsive devices at low or high conductance (Fig. 9). This paperexplores the relative importance of each of these factors.

250 125 10528 C t

X1B C

250hiddenneurons

125hiddenneurons

10outputneurons

528inputneurons

0 D

Correct answerD

=x1 g1

B C 1 D D=x2 g2

j k lB C 8

D

D D=xl gl

Xi

X528

B C

wij = * xi * j9

learning rate

D=x10

D

NN result

g10i

learning rate

Fig. 3 In super-vised learning,error terms j areback-propagated,adjusting eachweight wij tominimize an en-ergy function bygradient descent,reducing clas-sification errorbetween com-puted (xDl ) anddesired outputvectors (gl).

Simulated Trained only on first 5,000 MNIST images99.6

Simulated accuracy [%]

Trained on

MNIST images

99a ed o

all 60,000 MNIST images

T t97

After training, the ability of an NN to

Testaccuracies

80

90te t a g, t e ab ty o a to

generalize is typically assessed by evaluating accuracy on a previously-

unseen test set of 10,000 images

0

50 Training epoch(During each epoch, each training image is presented once)

0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)

xiGij Gij

+

Signal (voltage, time)

xi+2

xi+1ij

Gi+1j

ij

Gi+1j+

+

Ij+ xi Gij+

Ij

xi GijMj

Gi+2j Gi+2j+

-Fig. 4 A 3layer perceptron network can clas-sify previously unseen (test) MNIST hand-written digits with up to 97% accuracy[4].Training on a subset of the images sacrificessome generalization accuracy but speeds uptraining.

Fig. 5 By comparing total readsignal between pairs of bitlines,summation of synaptic weights(encoded as conductance differ-ences, wij = G+ G) ishighly parallel.

wij =*xi *j

xiNeuron value

Back-p

*xi *j

ation ropagated co

j

occu

rren

ces

durin

g si

mul

a

orrection

Fig. 6 Back-propagation callsfor each weightto be updated byw = xiwij ,where is thelearning rate.Colormap showslog(occurrences),in the 1st layer,during NN train-ing (blue curve,Fig. 4); whitecontours identifythe quantizedincrease in theinteger weight.

xiNeuron

value

Back-

fire 1 pulse(timeslot A)

2 pulses(A,B)

3 pulses(A,B,C)

ALL4 pulses

NO pulses

xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and

j

ted corre

112

on

xi andj )

j

ection

1fire

durin

g si

mul

atio

1 11 pulse(timeslot A)

NO pulses

occu

rren

ces

d

1

2

2

31

2 pulses(A,C)

3 pulses(A,B,C)Programming

l 2 3 41( , , )

ALL4 pulses

pulsesaffecting

NVM

Fig. 7 Ina crossbar,efficient learn-ing requiresneurons toupdate weightsin parallel,firing pulseswhose overlapat the variousNVM devicesimplementstraining.

Conventional weight update

80

90

cy [%

] 100Conventional weight update (Fig. 6)

Crossbar-compatibleweight update (Fig. 7)

60

70

80

cura

c

99.6

99.9

C ti l

40

50

60

ed a

cc

97

99

Testi

Conventionalweight update

20

30

ulat

e

050

8090 accuracies

0 5 10 15 200

10

Sim 0 2 4 6 8 10 12 14 16 18 20

0

Training epoch

Fig. 8 Computer NN simulations show that a crossbar-compatible weight-updaterule (Fig.7) is just as effective as the conventional update rule (Fig.6).

While boundingG values reduces NN training accuracy slightly(Fig.10), unidirectionality and nonlinearity in theG-response strong-ly degrade accuracy. Figure insets (Fig.10) map NVM-pair synapsestates on a diamond-shaped plot ofG+ vs. G (weight is vertical po-sition). In this context (Fig.11), a synapse with a highly asymmetricG-response moves only unidirectionally, from left-to-right. Onceone G is saturated, subsequent training can only increase the otherG value, reducing weight magnitude, deleting trained information,and degrading accuracy. Nonlinearity in G-response further en-courages weights of low value (Fig.11), which can lead to networkfreeze-out (no weight changes, Fig. 10 inset). One solution tothe highly asymmetric response of PCM devices is occasional RE-SET [2], moving synapses back to the left edge of the G-diamondwhile preserving weight value (with iterative SETs, Fig. 12 inset).

G G A tDevices Stuck-ON

G GNonlinearity(G-per-pulse @ low-G different

from G-per-pulse @ high-G)

Asymmetry(G-per-pulse for increasing G

different from G-per-pulse for decreasing G,

e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax

Stochasticity(erroneous weight updates

ibl )are possible)

Increasing G-responseDecreasingG-response

# pulses # pulses

p

Dead devices

Fig. 9 The conductance response of an NVM device exhibits imperfections,including nonlinearity, stochasticity, varying maxima, asymmetry between in-creasing/decreasing responses, and non-responsive devices (at low or high G).

[%]

Linear, unbounded100Linear, bounded, bi-directional

A80

acy

[

G IncreasingG-response DecreasingG-response

60

ccur

a

Non-

# pulses # pulses

40ed a

c

Linear, bounded,uni-directional

Nonlinear, bounded,uni-directional

B

20

mul

at 1kG+ G+ G+ B

100Randomi iti l i ht

0Si

m1G AG G10

Binitial weightdistribution

00 4 8 12 16 20

Training epochFig. 10 Bounding G values reduces NN training accuracy slightly, but unidi-rectionality and nonlinearity in G-response strongly degrade accuracy. Figureinsets map NVM-pair synapse states on a diamond-shaped plot of G+ vs. G

(weight is vertical position) for a sampled subset of the weights.

G+ CA G

+

ghtA +

C# pulses W

eig

BC

DG- D

B

D

# pulsesC

AB

G- Fig. 11 IfG values can only be increased (asymmetricG-response), a synapseat point A (G+ saturated) can only increase G, leading to a low weight value(B). If response at small G values differs from that at large G (nonlinear G-response), alternating weight updates can no longer cancel. As synapses tendto get herded into the same portion of the G-diamond (C D), the decrease inaverage weight can lead to network freeze-out.

However, if this is not done frequently enough, weight stagnationwill degrade NN accuracy (Fig.12).

Experimental resultsWe implemented a 3layer perceptron of 164,885 synapses

(Figs.2,3) on a 500 661 array of mushroom-cell [6], 1T1R PCMdevices (180nm node, Fig.13). While the update algorithm (Fig.7)is fully compatible with a crossbar implementation, our hardwareallows only sequential access to each PCM device (Fig. 14). Forread, a sense amplifier measures G values and thus weights for thesoftware-based neurons, mimicking column- and row-based inte-grations. Weights are increased (decreased) by identical partialSET pulses (Fig.7) to increase G+ (G) (Fig.15). The deviation

100

chs] on

90

95te

r 20

epo

c Trainingset

85

90cy

[%, a

ft

75

80

ccur

ac onTest

set70

75

ted

ac set

10 100 100060

65

imul

at

10 100 1000# of MNIST examples between RESETsS

i

Fig. 12 Synapses with large conductance values (inset, right edge ofG-diamond)can be refreshed (moved left) while preserving the weight (to some accuracy),by RESETs to bothG followed by a partial SET of one. If such RESETs are tooinfrequent, weight evolution stagnates and NN accuracy degrades.

M2

V1M1

ILDvia

GSTM1 M1

via viaFET

12

R fRead mode Write modeReferencecurrent Iref

Sense Amplifier

ReferencevoltageVdriveWrite

Driver

ID > IrefYES or NO?

Device

ID VdriveMaster Bitline selectedby column address

PCM

Devicecurrent ID

Row addressselects

wordline

Fig. 13 Mushroom-cell [6],1T1R PCM devices (180nmnode) with 2 metal intercon-nect layers enable 5121024arrays.

Fig. 14 A 1-bit sense amplifier measures Gvalues, passing the data to software-based neu-rons. Conductances are increased by identical25ns partialSET pulses to increaseG+ (G)(Fig. 7), or by RESETs to both G followed byan iterative SET procedure (Fig.12).

EVENTUAL CROSSBAR IMPLEMENTATION

OUR IMPLEMENTATIONHARDWARE

FULL HARDWAREWrite random weights

to PCM G values

Read PCM G valuesRandomize weights

Present next MNIST example

TIO

N

SOFTWARE

Compute f( xi wij )

Measure I+ - I- xi wij

Non-linearity in Repeat for

Repeat for ealayer, left to

WA

RD P

ROPA

GAT

Compute j corrections

hardware neuron

yer

Repeat for each NN layer

ach NN

rightFO

RW

p j

Compute pulses to apply (Fig. 7) to each

synapse (G+ or G-)


B th t ( ) d t fo

r eac

h N

N la

y

Repeat for ealeft to

GAT

ION

Both upstream (xi) and downstream neurons (j)

apply pulses (Fig. 7), programming synapse

Repe

at

ach NN

layer, right

BACK

PRO

PAG

After each mini-batch (5 examples), apply

pulses to PCM G values

After 20 examples, 2 RESETs (+1 incremental SET) for

synapses with large G values

Fig. 15 Although Gvalues are measuredsequentially, weightsummation and weightupdate procedures in oursoftware-based neuronsclosely mimic the column(and row) integrationsand pulse-overlap pro-gramming needed forparallel operations acrossa crossbar array. How-ever, since occasionalRESET is triggered whenboth G+ and G arelarge, serial device accessis required to obtainand then re-programindividual conductances.

from true crossbar implementation occurs upon occasional RESET(Fig.12), triggered when either G+ or G are large, thus requiringboth knowledge of and control over individual G values.

Fig.16 shows measured accuracies for a hardware-synapse NN,with all weight operations taking place on PCM devices. Toreduce test time, weight updates for each mini-batch of 5 MNISTexamples were applied together. Fig.17 plots measuredG-response,stochasticity, variability, stuck-ON pixel rate, and RESET accuracy.By matching all parameters including stochasticity (Fig.18) to thosemeasured during the experiment, a NN computer simulation can

90

100

%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM

80

90

acy

[%

Experiment

60

70

accu

r

8090

100 Matched simulation

40

50

ntal

a

506070

20

30

rim

en

Map offinal 10

203040

10

20

Training epochExpe

finalPCM 0 5 10 15 200

10

conductances0 1 2

0g p

Fig. 16 Training and test accuracy for a 3layer perceptron of 164,885 hardware-synapses, with all weight operations taking place on a 500 661 array ofmushroom-cell [6] PCM devices (Fig.13). Also shown is a matched computersimulation of this NN, using parameters extracted from the experiment.

5% devices stuck-ONMinimum G-response

80%

90%

100% Minimum G response

Initial 6MeasuredG-per-pulse

100%

50%

60%

70%

80%

ence

s weights

Final 2

4

p p[uS]

0%

40%

50%

60%

ccur

re330,500

PCM devices

weightsFrom all 31 million partial-SET pulses

5 10 15 20-10

G [uS]

20%

30%Oc PCM devices

MaximumG-response

5 10 15 20Measured RESET

12k

8kWeightafter

1000

0

from all 708kRESETs

Modeledvariability

0 10 20 30 40 500

10%

M d d t

4k

0Weight before

RESET-1000

-1000 10000

Measured conductance [uS]Fig. 17 50-point cumulative distributions of experimentally measured conduc-tances for the 500 661 PCM array, showing variability and stuck-ON pixelrate. Insets shows the measured RESET accuracy, and the rate and stochasticityof G-response, plotted as a colormap of G-per-pulse vs. G.

20

22 Modeledconductance

1responses

18

20 conductanceresponseG [uS]

p

Modeled

14

16G [ ]Medianresponse

Modeledconductancevariability

G per pulse 100%

10

12response

4

6G-per-pulse[uS]

100%

0%

50%

6

82

4

2

4

5 10 15 20-10 G [uS]

0 3 6 9 12 15 182

Pulse #

5 10 15 20

Fig. 18 Fitted G-response vs. # of pulses (blue average, red 1 responses),obtained from our computer model (inset) for the rate and stochasticity of G-response (G-per-pulse vs. G) matched to experiment (Fig.17).

precisely reproduce the measured accuracy trends (Fig.16).

Tolerancing and power considerationsWe use this matched NN simulation to explore the importance

100 On Trainingch

s] a) b) c) d) set e)

60

80

ter

20 e

poc

On Test set

set

20

40

Relative Relative G-transition

Relative max-G

Percentagedead

Percentagestuck-ONcy

[%, a

ft

0.1 1 100stochasticity

0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100

slope variability conductances conductances

ccur

ac

f) g) h) i) j)

60

80

ted

ac

20

40

# of MNISTexamples between RESET

A

Mini-batchi

RelativeRelative

slope (f) of imul

at

1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response

p ( )

Si

Fig. 19 Matched simulations show that an NVM-based NN is highly robustto stochasticity, variable maxima, the presence of non-responsive devices, andinfrequent or inaccurate RESETs. Mini-batch size = 1 avoids the need to ac-cumulate weight updates before applying them. However, the nonlinear andasymmetricG-response limits accuracy to85%, and require learning-rate andneuron-response (f ) to be precisely tuned.

100%] Linear, bounded,

80

100

cy [% symmetric, bi-directional

60

80

cura Non-linear, bounded, symmetric, bi-directional

asymmetric40

60

d ac

c

Linear, bounded,

nse

asymmetric, bi-directional

20

40

ulat

e

eG

-res

po

0

20 TrainingepochSi

mu

Rela

tive

Decreasing G-responseIncreasing G-response

0 5 10 15 200

S

Fig. 20 NN performance is improved if G-response is linear and symmetric(green curve) rather than nonlinear (red). However, asymmetry between the up-and down-going G-responses (blue), if not corrected in the weight-update rule(Fig.7), can strongly degrade performance by favoring particular regions of theG-diamond (Figs.10,11).

of NVM imperfections. Fig. 19 shows final training (test) accu-racy as a function of variations in NVM and NN parameters. NNperformance is highly robust to stochasticity, variable maxima, thepresence of non-responsive devices, and infrequent RESETs. Amini-batch of size 1 allows weight updates to be applied immedi-ately. However, as mentioned earlier, nonlinearity and asymmetry inG-response limit the maximum possible accuracy (here, to 85%),and require precise tuning of the learning rate and neuron-response(f ). Too low a learning rate and no weights receive any updates; toohigh, and the imperfections in the NVM response generate chaos.

NN performance with NVM-based synapses offers high accu-racy if G-response is linear and symmetric (Fig. 20, green curve)rather than nonlinear (red curve). Asymmetry in G-response (bluecurve) strongly degrades performance. While the asymmetric G-response of PCM makes it necessary to occasionally stop training,measure all conductances, and apply RESETs and iterative SETs,energy usage can be reasonable if RESETs are infrequent (Fig.21,inset), and learning rate is low (Fig.21).

ConclusionsUsing 2 phase-change memory (PCM) devices per synapse, a

3layer perceptron with 164,885 synapses was trained with back-

7 [ 3pJ SET 30pJ RESET30

gy[m

J] PCM[ 3pJ SET, 30pJ RESET]

6

7

mJ] PCM

[ 3pJ SET, 30pJ RESETapplied each 50 examples]

3

10

ng E

nerg

5

rgy

[m

10 30 100 300 1000

1 Trai

ni

# of MNISTExamples between RESETs

4

Ener

10 30 100 300 1000

2

3

ning

Bi-directionalNVM

1

2

Trai

n NVM[ 1pJ SET & RESET ]

0.1 0.2 0.5 1 2 50

T

Learning rate

Fig. 21 Despite the higher power involved in RESET rather than partial-SET (30pJ and 3pJ for highly-scaled PCM [1]), total energy costs of trainingcan be minimized if RESETs are sufficiently infrequent (inset). Low-energytraining requires low learning rates, which minimizes the number of synapticprogramming pulses. At higher learning rates, even a bi-directional, linearRRAM requiring no RESET and offering low-power (1pJ per pulse) can lead tolarge training energy.

Table of conclusionsLarge 3-layer network with 2 PCM Moderately high accuracy (82%) devices/synapse. (Figs. 1-3, 5) Back-propagation weight update rule compatible with crossbar array. (Figs. 6-8)

achieved on MNIST handwritten digit recognition with two training epochs. (Fig. 16)

NVM models identified issues for training: Conductance bounds, nonlinearity, and asymmetry must be considered. (Figs. 9-10, 20)

PCM response and asymmetry mitigated by RESET strategy, mappingof response, choice of update pulse. (Figs. 11, 12, 17)be considered. (Figs. 9 10, 20) (Figs. 11, 12, 17)

Model of PCM allows well-matched simulation of experiment (Figs. 16-18), variation of network parameters allows tolerancing (Fig 19)

NN is resilient to NVM variations (Figs. 19a-e) and RESET strategy (Figs. 19f-g), but sensitive to learning rate and neuron response function (Figs 19i j)allows tolerancing. (Fig. 19) neuron response function. (Figs. 19i-j)

Bidirectional NVM with no special RESET strategy and good performance requires scheme for

For PCM, keeping RESET frequencydown and learning rate above freeze-out threshold allows reasonable

symmetric response. (Fig. 20) training energy. (Fig. 21)

Fig. 22 NN built with NVM-based synapses tend to be highly sensitive togradient effects (nonlinearity and asymmetry in G-response) that steer allsynaptic weights towards either high or low values, yet are highly resilient torandom effects (NVM variability, yield, and stochasticity).

propagation on a subset (5000 examples) of the MNIST database ofhandwritten digits to high accuracy of (82.2%, 82.9% on test set).A weight-update rule compatible for NVM+selector crossbar arrayswas developed; the G-diamond concept illustrates issues createdby nonlinearity and asymmetry in NVM conductance response. Us-ing a neural network (NN) simulator matched to the experimentaldemonstrator, extensive tolerancing was performed (Fig.22). NVM-based NN are highly resilient to random effects (NVM variability,yield, and stochasticity), but highly sensitive to gradient effectsthat act to steer all synaptic weights. A learning-rate just highenough to avoid network freeze-out is shown to be advantageousfor both high accuracy and low training energy.

References[1] B. Jackson et al., ACM J. Emerg. Tech. Comput. Syst., 9(2), 12 (2013).[2] M. Suri et al., IEDM Tech. Digest, 4.4 (2011).[3] D. Rumelhart et al., Parallel Distributed Processing, MIT Press (1986).[4] Y. LeCun et al., Proc. IEEE, 86(11), 2278 (1998).[5] G. Hinton et al., Science, 313(5786) 504 (2006).[6] M. Breitwisch et al., VLSI Tech. Symp., T6B-3 (2007).[7] B. Rajendran et al., IEEE Trans. Elect. Dev., 60(1), 246 (2013).

Selector deviceSynaptic weight

NVMN1

N2

N1

M1

M2P

P1

O1pairs

Conductance

1

N2

P2

O2

Nn

Oo+ - + -

Nn MmMmMm Pp

M1+

Fig. 1 Neuro-inspired non-Von Neumann comput-ing [1,2], in which neurons activate each other throughdense networks of programmable synaptic weights, canbe implemented using dense crossbar arrays of non-volatile memory (NVM) and selector device-pairs.

250 125 10528 250hiddenneurons

125hiddenneurons

10outputneurons

0

x1Cropped(22x24

528input

neurons

A

A

x1B

wij

(22x24pixel)MNISTimages

1

xiA

xjB

ij

x wA

xj8

B

xi wij

xj =f(xi wij)

xj9

x528

B AA

x250B

Fig. 2 In forward evaluation of a multilayer per-ceptron, each layers neurons drive the next layerthrough weights wij and a nonlinearity f(). In-put neurons are driven by pixels from successiveMNIST images (cropped to 2224); the 10 outputneurons identify which digit was presented.

250 125 10528 C t

X1B C

250hiddenneurons

125hiddenneurons

10outputneurons

528inputneurons

0 D

Correct answerD

=x1 g1

B C 1 D D=x2 g2

j k lB C 8

D

D D=xl gl

Xi

X528

B C

wij = * xi * j9

learning rate

D=x10

D

NN result

g10i

learning rate

Fig. 3 In supervised learning, error terms jare back-propagated, adjusting each weight wijto minimize an energy function by gradient de-scent, reducing classification error between com-puted (xDl ) and desired output vectors (gl).

Simulated Trained only on first 5,000 MNIST images99.6

Simulated accuracy [%]

Trained on

MNIST images

99a ed o

all 60,000 MNIST images

T t97

After training, the ability of an NN to

Testaccuracies

80

90te t a g, t e ab ty o a to

generalize is typically assessed by evaluating accuracy on a previously-

unseen test set of 10,000 images

0

50 Training epoch(During each epoch, each training image is presented once)

0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)

Fig. 4 A 3layer perceptron network can clas-sify previously unseen (test) MNIST handwrit-ten digits with up to97% accuracy[4]. Trainingon a subset of the images sacrifices some gener-alization accuracy but speeds up training.

xiGij Gij

+

Signal (voltage, time)

xi+2

xi+1ij

Gi+1j

ij

Gi+1j+

+

Ij+ xi Gij+

Ij

xi GijMj

Gi+2j Gi+2j+

-Fig. 5 By comparing total read signal be-tween pairs of bitlines, summation of synap-tic weights (encoded as conductance differ-ences, wij =G+G) is highly parallel.

wij =*xi *j

xiNeuron value

Back-p

*xi *j

ation ropagated co

j

occu

rren

ces

durin

g si

mul

a

orrection

Fig. 6 Back-propagation callsfor each weightto be updated byw = xiwij ,where is thelearning rate.Colormap showslog(occurrences),in the 1st layer,during NN train-ing (blue curve,Fig. 4); whitecontours identifythe quantizedincrease in theinteger weight.

xiNeuron

value

Back-

fire 1 pulse(timeslot A)

2 pulses(A,B)

3 pulses(A,B,C)

ALL4 pulses

NO pulses

xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and

jted corre

112

on

xi andj )

jection

1fire

durin

g si

mul

atio

1 11 pulse(timeslot A)

NO pulses

occu

rren

ces

d

1

2

2

31

2 pulses(A,C)

3 pulses(A,B,C)Programming

l 2 3 41( , , )

ALL4 pulses

pulsesaffecting

NVM

Fig. 7 In a crossbar, efficient learning requires neuronsto update weights in parallel, firing pulses whose overlapat the various NVM devices implements training.

Conventional weight update

80

90

cy [%

] 100Conventional weight update (Fig. 6)

Crossbar-compatibleweight update (Fig. 7)

60

70

80

cura

c

99.6

99.9

C ti l

40

50

60

ed a

cc

97

99

Testi

Conventionalweight update

20

30

ulat

e

050

8090 accuracies

0 5 10 15 200

10

Sim 0 2 4 6 8 10 12 14 16 18 20

0

Training epoch

Fig. 8 Computer NN simulations showthat a crossbar-compatible weight-updaterule (Fig.7) is just as effective as the con-ventional update rule (Fig.6).

G G A tDevices Stuck-ON

G GNonlinearity(G-per-pulse @ low-G different

from G-per-pulse @ high-G)

Asymmetry(G-per-pulse for increasing G

different from G-per-pulse for decreasing G,

e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax

Stochasticity(erroneous weight updates

ibl )are possible)

Increasing G-responseDecreasingG-response

# pulses # pulses

p

Dead devices

Fig. 9 The conductance response of an NVM deviceexhibits imperfections, including nonlinearity, stochas-ticity, varying maxima, asymmetry between increas-ing/decreasing responses, and non-responsive devices(at low or high G).

[%]

Linear, unbounded100Linear, bounded, bi-directional

A80

acy

[

G IncreasingG-response DecreasingG-response

60

ccur

a

Non-

# pulses # pulses

40ed a

c

Linear, bounded,uni-directional

Nonlinear, bounded,uni-directional

B

20

mul

at 1kG+ G+ G+ B

100Randomi iti l i ht

0

Sim

1G AG G10

Binitial weightdistribution

00 4 8 12 16 20

Training epochFig. 10 Bounding G values reduces NN training accuracyslightly, but unidirectionality and nonlinearity in G-responsestrongly degrade accuracy. Figure insets map NVM-pairsynapse states on a diamond-shaped plot of G+ vs. G

(weight is vertical position) for a sampled subset of the weights.

G+ CA G

+

ghtA +

C# pulses W

eig

BC

DG- D

B

D

# pulsesC

AB

G- Fig. 11 If G values can only be increased (asym-metric G-response), a synapse at point A (G+ sat-urated) can only increase G, leading to a lowweight value (B). If response at small G valuesdiffers from that at largeG (nonlinearG-response),alternating weight updates can no longer cancel. Assynapses tend to get herded into the same portion ofthe G-diamond (C D), the decrease in averageweight can lead to network freeze-out.

100

chs] on

90

95

ter

20 e

poc Training

set

85

90

cy [%

, aft

75

80

ccur

ac onTest

set70

75

ted

ac set

10 100 100060

65

imul

at

10 100 1000# of MNIST examples between RESETsS

i

Fig. 12 Synapses with large conductancevalues (inset, right edge of G-diamond) canbe refreshed (moved left) while preservingthe weight (to some accuracy), by RESETsto bothG followed by a partial SET of one. Ifsuch RESETs are too infrequent, weight evo-lution stagnates and NN accuracy degrades.

M2

V1M1

ILDvia

GSTM1 M1

via viaFET

12

Fig. 13 Mushroom-cell [6],1T1R PCM devices (180nmnode) with 2 metal interconnectlayers enable 5121024 arrays.

R fRead mode Write modeReferencecurrent Iref

Sense Amplifier

ReferencevoltageVdriveWrite

Driver

ID > IrefYES or NO?

Device

ID VdriveMaster Bitline selectedby column address

PCM

Devicecurrent ID

Row addressselects

wordline

Fig. 14 A 1-bit sense amplifiermeasuresG values,passing the datato software-basedneurons. Con-ductances are in-creased by identi-cal 25ns partialSET pulses to in-crease G+ (G)(Fig. 7), or by RE-SETs to bothG fol-lowed by an itera-tive SET procedure(Fig.12).

EVENTUAL CROSSBAR IMPLEMENTATION

OUR IMPLEMENTATIONHARDWARE

FULL HARDWAREWrite random weights

to PCM G values

Read PCM G valuesRandomize weights

Present next MNIST example

TIO

N

SOFTWARE

Compute f( xi wij )

Measure I+ - I- xi wij

Non-linearity in Repeat for

Repeat for ealayer, left to

WA

RD P

ROPA

GAT


hardware neuron

yer

Repeat for each NN layer

ach NN

rightFO

RW

p j

Compute pulses to apply (Fig. 7) to each

synapse (G+ or G-)


B th t ( ) d t fo

r ea

ch N

N la

y

Repeat for ealeft to

GAT

ION

Both upstream (xi) and downstream neurons (j)

apply pulses (Fig. 7), programming synapse

Repe

at

ach NN

layer, right

BACK

PRO

PAG

After each mini-batch (5 examples), apply

pulses to PCM G values

After 20 examples, 2 RESETs (+1 incremental SET) for

synapses with large G values

Fig. 15 Although G values aremeasured sequentially, weightsummation and weight updateprocedures in our software-based neurons closely mimicthe column (and row) integra-tions and pulse-overlap pro-gramming needed for paralleloperations across a crossbar ar-ray. However, since occasionalRESET is triggered when bothG+ andG are large, serial de-vice access is required to obtainand then re-program individualconductances.

90

100

%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM

80

90

acy

[%

Experiment

60

70

accu

r

8090

100 Matched simulation

40

50

ntal

a

506070

20

30

rim

en

Map offinal 10

203040

10

20

Training epochExpe

finalPCM 0 5 10 15 200

10

conductances0 1 2

0g p

Fig. 16 Training and test accuracy for a 3layer perceptron of164,885 hardware-synapses, with all weight operations takingplace on a 500 661 array of mushroom-cell [6] PCM devices(Fig. 13). Also shown is a matched computer simulation ofthis NN, using parameters extracted from the experiment.

5% devices stuck-ONMinimum G-response

80%

90%

100% Minimum G response

Initial 6MeasuredG-per-pulse

100%

50%

60%

70%

80%

ence

s weights

Final 2

4

p p[uS]

0%

40%

50%

60%

ccur

re

330,500 PCM devices

weightsFrom all 31 million partial-SET pulses

5 10 15 20-10

G [uS]

20%

30%Oc PCM devices

MaximumG-response

5 10 15 20Measured RESET

12k

8kWeightafter

1000

0

from all 708kRESETs

Modeledvariability

0 10 20 30 40 500

10%

M d d t

4k

0Weight before

RESET-1000

-1000 10000

Measured conductance [uS]Fig. 17 50-point cumulative distributions of ex-perimentally measured conductances for the 500661 PCM array, showing variability and stuck-ONpixel rate. Insets shows the measured RESET accu-racy, and the rate and stochasticity of G-response,plotted as a colormap of G-per-pulse vs. G.

20

22 Modeledconductance

1responses

18

20 conductanceresponseG [uS]

p

Modeled

14

16G [ ]Medianresponse

Modeledconductancevariability

G per pulse 100%

10

12response

4

6G-per-pulse[uS]

100%

0%

50%

6

82

4

2

4

5 10 15 20-10 G [uS]

0 3 6 9 12 15 182

Pulse #

5 10 15 20

Fig. 18 Fitted G-response vs. # of pulses(blue average, red 1 responses), obtainedfrom our computer model (inset) for the rateand stochasticity of G-response (G-per-pulse vs. G) matched to experiment (Fig.17).

100 On Training

chs] a) b) c) d) set e)

60

80

ter

20 e

poc

On Test set

set

20

40

Relative Relative G-transition

Relative max-G

Percentagedead

Percentagestuck-ONcy

[%, a

ft

0.1 1 100stochasticity

0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100

slope variability conductances conductances

ccur

ac

f) g) h) i) j)

60

80

ted

ac

20

40

# of MNISTexamples between RESET

A

Mini-batchi

RelativeRelative

slope (f) of imul

at

1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response

p ( )

Si

Fig. 19 Matchedsimulations show thatan NVM-based NN ishighly robust to stochas-ticity, variable maxima,the presence of non-responsive devices, andinfrequent or inaccurateRESETs. Mini-batchsize = 1 avoids the needto accumulate weight up-dates before applyingthem. However, thenonlinear and asymmet-ric G-response limits ac-curacy to 85%, and re-quire learning-rate andneuron-response (f ) tobe precisely tuned.

100%] Linear, bounded,

80

100

cy [% symmetric, bi-directional

60

80

cura Non-linear, bounded, symmetric, bi-directional

asymmetric40

60d

acc

Linear, bounded,

nse

asymmetric, bi-directional

20

40

ulat

e

eG

-res

po

0

20 TrainingepochSi

mu

Rela

tive

Decreasing G-responseIncreasing G-response

0 5 10 15 200

S

Fig. 20 NN performance is improved if G-response is linear and symmetric (green curve)rather than nonlinear (red). However, asym-metry between the up- and down-going G-responses (blue), if not corrected in the weight-update rule (Fig.7), can strongly degrade per-formance by favoring particular regions of theG-diamond (Figs.10,11).

7 [ 3pJ SET 30pJ RESET30

gy[m

J] PCM[ 3pJ SET, 30pJ RESET]

6

7

mJ] PCM

[ 3pJ SET, 30pJ RESETapplied each 50 examples]

3

10

ng E

nerg

5

rgy

[m

10 30 100 300 1000

1 Trai

ni

# of MNISTExamples between RESETs

4

Ener

10 30 100 300 1000

2

3

ning

Bi-directionalNVM

1

2

Trai

n NVM[ 1pJ SET & RESET ]

0.1 0.2 0.5 1 2 50

T

Learning rate

Fig. 21 Despite the higherpower involved in RESETrather than partial-SET (30pJand 3pJ for highly-scaledPCM [1]), total energy costsof training can be minimizedif RESETs are sufficiently in-frequent (inset). Low-energytraining requires low learn-ing rates, which minimizesthe number of synaptic pro-gramming pulses. At higherlearning rates, even a bi-directional, linear RRAM re-quiring no RESET and offer-ing low-power (1pJ per pulse)can lead to large training en-ergy.

Table of conclusionsLarge 3-layer network with 2 PCM Moderately high accuracy (82%) devices/synapse. (Figs. 1-3, 5) Back-propagation weight update rule compatible with crossbar array. (Figs. 6-8)

achieved on MNIST handwritten digit recognition with two training epochs. (Fig. 16)

NVM models identified issues for training: Conductance bounds, nonlinearity, and asymmetry must be considered. (Figs. 9-10, 20)

PCM response and asymmetry mitigated by RESET strategy, mappingof response, choice of update pulse. (Figs. 11, 12, 17)be considered. (Figs. 9 10, 20) (Figs. 11, 12, 17)

Model of PCM allows well-matched simulation of experiment (Figs. 16-18), variation of network parameters allows tolerancing (Fig 19)

NN is resilient to NVM variations (Figs. 19a-e) and RESET strategy (Figs. 19f-g), but sensitive to learning rate and neuron response function (Figs 19i j)allows tolerancing. (Fig. 19) neuron response function. (Figs. 19i-j)

Bidirectional NVM with no special RESET strategy and good performance requires scheme for

For PCM, keeping RESET frequencydown and learning rate above freeze-out threshold allows reasonable

symmetric response. (Fig. 20) training energy. (Fig. 21)

Fig. 22 NN built with NVM-based synapsestend to be highly sensitive to gradienteffects (nonlinearity and asymmetry in G-response) that steer all synaptic weightstowards either high or low values, yet arehighly resilient to random effects (NVMvariability, yield, and stochasticity).

References[1] B. Jackson et al., ACM J.

Emerg. Tech. Comput.Syst., 9(2), 12 (2013).

[2] M. Suri et al., IEDM Tech.Digest, 4.4 (2011).

[3] D. Rumelhart et al., Paral-lel Distributed Processing,MIT Press (1986).

[4] Y. LeCun et al., Proc.IEEE, 86(11), 2278 (1998).

[5] G. Hinton et al., Science,313(5786) 504 (2006).

[6] M. Breitwisch et al., VLSITech. Symp., T6B-3 (2007).

[7] B. Rajendran et al., IEEETrans. Elect. Dev.,60(1), 246 (2013).

experimental demonstration and tolerancing of a large ... · experimental demonstration and...

Documents