experimental demonstration and tolerancing of a large ... · experimental demonstration and...
TRANSCRIPT
-
Experimental demonstration and tolerancing of a large-scale neural network(165,000 synapses), using phase-change memory as the synaptic weight element
G.W.Burr, R.M.Shelby, C.di Nolfo, J.W.Jang, R.S.Shenoy, P.Narayanan,K.Virwani, E.U.Giacometti, B.Kurdi, and H.Hwang
IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, Tel: (408) 9271512, E-mail: [email protected] University of Science and Technology, Pohang, South Korea
AbstractUsing 2 phase-change memory (PCM) devices per synapse, a
3layer perceptron network with 164,885 synapses is trained ona subset (5000 examples) of the MNIST database of handwrittendigits using a backpropagation variant suitable for NVM+selectorcrossbar arrays, obtaining a training (generalization) accuracy of82.2% (82.9%). Using a neural network (NN) simulator matched tothe experimental demonstrator, extensive tolerancing is performedwith respect to NVM variability, yield, and the stochasticity, linear-ity and asymmetry of NVM-conductance response.
IntroductionDense arrays of nonvolatile memory (NVM) and selector device-
pairs (Fig.1) can implement neuro-inspired non-Von Neumann com-puting [1,2], using pairs [2] of NVM devices as programmable(plastic) bipolar synapses. Work to date has emphasized the Spike-Timing-Dependent-Plasticity (STDP) algorithm [1,2], motivatedby synaptic measurements in real brains, yet experimental NVMdemonstrations have been limited in size (100 synapses).
Unlike STDP, backpropagation [3] is a widely-used, well-studiedNN, offering benchmark-able performance on datasets such as hand-written digits (MNIST) [4]. In forward evaluation of a multilayerperceptron, each layers inputs (xi) drive the next layers neurons
Selector deviceSynaptic weight
NVMN1
N2
N1
M1
M2P
P1
O1pairs
Conductance
1
N2
P2
O2
Nn
Oo+ - + -
Nn MmMmMm Pp
M1+
Fig. 1 Neuro-inspired non-Von Neumann computing [1,2], in which neuronsactivate each other through dense networks of programmable synaptic weights,can be implemented using dense crossbar arrays of nonvolatile memory (NVM)and selector device-pairs.
250 125 10528 250hiddenneurons
125hiddenneurons
10outputneurons
0
x1Cropped(22x24
528input
neurons
A
A
x1B
wij
(22x24pixel)MNISTimages
1
xiA
xjB
ij
x wA
xj8
B
xi wij
xj =f(xi wij)
xj9
x528
B AA
x250B
Fig. 2 In forward evaluation of a multilayer perceptron, each layers neuronsdrive the next layer through weights wij and a nonlinearity f(). Input neuronsare driven by pixels from successive MNIST images (cropped to 2224); the 10output neurons identify which digit was presented.
through weights wij and a nonlinearity f() (Fig. 2). Supervisedlearning occurs (Fig. 3) by back-propagating error terms j to ad-just each weight wij . A 3layer network is capable of accuracies,on previously unseen test images (generalization), of 97% [4](Fig.4); even higher accuracy is possible by first pre-training theweights in each layer [5]. Like STDP, low-power neurons should beachievable by emphasizing brief spikes[7] and local-only clocking.
Considerations for a crossbar implementationBy encoding synaptic weight in conductance difference between
paired NVMs, wij = G+G [2], forward propagation simplycompares total read signal on bitlines (Fig.5). However, backprop-agation [3] calls for weight updates w xij (Fig. 6), requiringupstream i and downstream j neurons to exchange information foreach synapse. In a crossbar, learning becomes much more effi-cient when neurons modify weights in parallel, by firing pulseswhose overlap at the various NVM devices implements training [1](Fig.7). Fig.8 shows, using a simulation of the NN in Figs.2,3, thatthis adaptation for NVM implementation has no effect on accuracy.
However, the conductance response of any real NVM deviceexhibits imperfections that could still decidedly affect NN per-formance, including nonlinearity, stochasticity, varying maxima,asymmetry between increasing/decreasing responses, and non-re-sponsive devices at low or high conductance (Fig. 9). This paperexplores the relative importance of each of these factors.
250 125 10528 C t
X1B C
250hiddenneurons
125hiddenneurons
10outputneurons
528inputneurons
0 D
Correct answerD
=x1 g1
B C 1 D D=x2 g2
j k lB C 8
D
D D=xl gl
Xi
X528
B C
wij = * xi * j9
learning rate
D=x10
D
NN result
g10i
learning rate
Fig. 3 In super-vised learning,error terms j areback-propagated,adjusting eachweight wij tominimize an en-ergy function bygradient descent,reducing clas-sification errorbetween com-puted (xDl ) anddesired outputvectors (gl).
Simulated Trained only on first 5,000 MNIST images99.6
Simulated accuracy [%]
Trained on
MNIST images
99a ed o
all 60,000 MNIST images
T t97
After training, the ability of an NN to
Testaccuracies
80
90te t a g, t e ab ty o a to
generalize is typically assessed by evaluating accuracy on a previously-
unseen test set of 10,000 images
0
50 Training epoch(During each epoch, each training image is presented once)
0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)
xiGij Gij
+
Signal (voltage, time)
xi+2
xi+1ij
Gi+1j
ij
Gi+1j+
+
Ij+ xi Gij+
Ij
xi GijMj
Gi+2j Gi+2j+
-Fig. 4 A 3layer perceptron network can clas-sify previously unseen (test) MNIST hand-written digits with up to 97% accuracy[4].Training on a subset of the images sacrificessome generalization accuracy but speeds uptraining.
Fig. 5 By comparing total readsignal between pairs of bitlines,summation of synaptic weights(encoded as conductance differ-ences, wij = G+ G) ishighly parallel.
-
wij =*xi *j
xiNeuron value
Back-p
*xi *j
ation ropagated co
j
occu
rren
ces
durin
g si
mul
a
orrection
Fig. 6 Back-propagation callsfor each weightto be updated byw = xiwij ,where is thelearning rate.Colormap showslog(occurrences),in the 1st layer,during NN train-ing (blue curve,Fig. 4); whitecontours identifythe quantizedincrease in theinteger weight.
xiNeuron
value
Back-
fire 1 pulse(timeslot A)
2 pulses(A,B)
3 pulses(A,B,C)
ALL4 pulses
NO pulses
xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and
j
ted corre
112
on
xi andj )
j
ection
1fire
durin
g si
mul
atio
1 11 pulse(timeslot A)
NO pulses
occu
rren
ces
d
1
2
2
31
2 pulses(A,C)
3 pulses(A,B,C)Programming
l 2 3 41( , , )
ALL4 pulses
pulsesaffecting
NVM
Fig. 7 Ina crossbar,efficient learn-ing requiresneurons toupdate weightsin parallel,firing pulseswhose overlapat the variousNVM devicesimplementstraining.
Conventional weight update
80
90
cy [%
] 100Conventional weight update (Fig. 6)
Crossbar-compatibleweight update (Fig. 7)
60
70
80
cura
c
99.6
99.9
C ti l
40
50
60
ed a
cc
97
99
Testi
Conventionalweight update
20
30
ulat
e
050
8090 accuracies
0 5 10 15 200
10
Sim 0 2 4 6 8 10 12 14 16 18 20
0
Training epoch
Fig. 8 Computer NN simulations show that a crossbar-compatible weight-updaterule (Fig.7) is just as effective as the conventional update rule (Fig.6).
While boundingG values reduces NN training accuracy slightly(Fig.10), unidirectionality and nonlinearity in theG-response strong-ly degrade accuracy. Figure insets (Fig.10) map NVM-pair synapsestates on a diamond-shaped plot ofG+ vs. G (weight is vertical po-sition). In this context (Fig.11), a synapse with a highly asymmetricG-response moves only unidirectionally, from left-to-right. Onceone G is saturated, subsequent training can only increase the otherG value, reducing weight magnitude, deleting trained information,and degrading accuracy. Nonlinearity in G-response further en-courages weights of low value (Fig.11), which can lead to networkfreeze-out (no weight changes, Fig. 10 inset). One solution tothe highly asymmetric response of PCM devices is occasional RE-SET [2], moving synapses back to the left edge of the G-diamondwhile preserving weight value (with iterative SETs, Fig. 12 inset).
G G A tDevices Stuck-ON
G GNonlinearity(G-per-pulse @ low-G different
from G-per-pulse @ high-G)
Asymmetry(G-per-pulse for increasing G
different from G-per-pulse for decreasing G,
e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax
Stochasticity(erroneous weight updates
ibl )are possible)
Increasing G-responseDecreasingG-response
# pulses # pulses
p
Dead devices
Fig. 9 The conductance response of an NVM device exhibits imperfections,including nonlinearity, stochasticity, varying maxima, asymmetry between in-creasing/decreasing responses, and non-responsive devices (at low or high G).
[%]
Linear, unbounded100Linear, bounded, bi-directional
A80
acy
[
G IncreasingG-response DecreasingG-response
60
ccur
a
Non-
# pulses # pulses
40ed a
c
Linear, bounded,uni-directional
Nonlinear, bounded,uni-directional
B
20
mul
at 1kG+ G+ G+ B
100Randomi iti l i ht
0Si
m1G AG G10
Binitial weightdistribution
00 4 8 12 16 20
Training epochFig. 10 Bounding G values reduces NN training accuracy slightly, but unidi-rectionality and nonlinearity in G-response strongly degrade accuracy. Figureinsets map NVM-pair synapse states on a diamond-shaped plot of G+ vs. G
(weight is vertical position) for a sampled subset of the weights.
G+ CA G
+
ghtA +
C# pulses W
eig
BC
DG- D
B
D
# pulsesC
AB
G- Fig. 11 IfG values can only be increased (asymmetricG-response), a synapseat point A (G+ saturated) can only increase G, leading to a low weight value(B). If response at small G values differs from that at large G (nonlinear G-response), alternating weight updates can no longer cancel. As synapses tendto get herded into the same portion of the G-diamond (C D), the decrease inaverage weight can lead to network freeze-out.
However, if this is not done frequently enough, weight stagnationwill degrade NN accuracy (Fig.12).
Experimental resultsWe implemented a 3layer perceptron of 164,885 synapses
(Figs.2,3) on a 500 661 array of mushroom-cell [6], 1T1R PCMdevices (180nm node, Fig.13). While the update algorithm (Fig.7)is fully compatible with a crossbar implementation, our hardwareallows only sequential access to each PCM device (Fig. 14). Forread, a sense amplifier measures G values and thus weights for thesoftware-based neurons, mimicking column- and row-based inte-grations. Weights are increased (decreased) by identical partialSET pulses (Fig.7) to increase G+ (G) (Fig.15). The deviation
-
100
chs] on
90
95te
r 20
epo
c Trainingset
85
90cy
[%, a
ft
75
80
ccur
ac onTest
set70
75
ted
ac set
10 100 100060
65
imul
at
10 100 1000# of MNIST examples between RESETsS
i
Fig. 12 Synapses with large conductance values (inset, right edge ofG-diamond)can be refreshed (moved left) while preserving the weight (to some accuracy),by RESETs to bothG followed by a partial SET of one. If such RESETs are tooinfrequent, weight evolution stagnates and NN accuracy degrades.
M2
V1M1
ILDvia
GSTM1 M1
via viaFET
12
R fRead mode Write modeReferencecurrent Iref
Sense Amplifier
ReferencevoltageVdriveWrite
Driver
ID > IrefYES or NO?
Device
ID VdriveMaster Bitline selectedby column address
PCM
Devicecurrent ID
Row addressselects
wordline
Fig. 13 Mushroom-cell [6],1T1R PCM devices (180nmnode) with 2 metal intercon-nect layers enable 5121024arrays.
Fig. 14 A 1-bit sense amplifier measures Gvalues, passing the data to software-based neu-rons. Conductances are increased by identical25ns partialSET pulses to increaseG+ (G)(Fig. 7), or by RESETs to both G followed byan iterative SET procedure (Fig.12).
EVENTUAL CROSSBAR IMPLEMENTATION
OUR IMPLEMENTATIONHARDWARE
FULL HARDWAREWrite random weights
to PCM G values
Read PCM G valuesRandomize weights
Present next MNIST example
TIO
N
SOFTWARE
Compute f( xi wij )
Measure I+ - I- xi wij
Non-linearity in Repeat for
Repeat for ealayer, left to
WA
RD P
ROPA
GAT
Compute j corrections
hardware neuron
yer
Repeat for each NN layer
ach NN
rightFO
RW
p j
Compute pulses to apply (Fig. 7) to each
synapse (G+ or G-)
Compute j corrections
B th t ( ) d t fo
r eac
h N
N la
y
Repeat for ealeft to
GAT
ION
Both upstream (xi) and downstream neurons (j)
apply pulses (Fig. 7), programming synapse
Repe
at
ach NN
layer, right
BACK
PRO
PAG
After each mini-batch (5 examples), apply
pulses to PCM G values
After 20 examples, 2 RESETs (+1 incremental SET) for
synapses with large G values
Fig. 15 Although Gvalues are measuredsequentially, weightsummation and weightupdate procedures in oursoftware-based neuronsclosely mimic the column(and row) integrationsand pulse-overlap pro-gramming needed forparallel operations acrossa crossbar array. How-ever, since occasionalRESET is triggered whenboth G+ and G arelarge, serial device accessis required to obtainand then re-programindividual conductances.
from true crossbar implementation occurs upon occasional RESET(Fig.12), triggered when either G+ or G are large, thus requiringboth knowledge of and control over individual G values.
Fig.16 shows measured accuracies for a hardware-synapse NN,with all weight operations taking place on PCM devices. Toreduce test time, weight updates for each mini-batch of 5 MNISTexamples were applied together. Fig.17 plots measuredG-response,stochasticity, variability, stuck-ON pixel rate, and RESET accuracy.By matching all parameters including stochasticity (Fig.18) to thosemeasured during the experiment, a NN computer simulation can
90
100
%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM
80
90
acy
[%
Experiment
60
70
accu
r
8090
100 Matched simulation
40
50
ntal
a
506070
20
30
rim
en
Map offinal 10
203040
10
20
Training epochExpe
finalPCM 0 5 10 15 200
10
conductances0 1 2
0g p
Fig. 16 Training and test accuracy for a 3layer perceptron of 164,885 hardware-synapses, with all weight operations taking place on a 500 661 array ofmushroom-cell [6] PCM devices (Fig.13). Also shown is a matched computersimulation of this NN, using parameters extracted from the experiment.
5% devices stuck-ONMinimum G-response
80%
90%
100% Minimum G response
Initial 6MeasuredG-per-pulse
100%
50%
60%
70%
80%
ence
s weights
Final 2
4
p p[uS]
0%
40%
50%
60%
ccur
re330,500
PCM devices
weightsFrom all 31 million partial-SET pulses
5 10 15 20-10
G [uS]
20%
30%Oc PCM devices
MaximumG-response
5 10 15 20Measured RESET
12k
8kWeightafter
1000
0
from all 708kRESETs
Modeledvariability
0 10 20 30 40 500
10%
M d d t
4k
0Weight before
RESET-1000
-1000 10000
Measured conductance [uS]Fig. 17 50-point cumulative distributions of experimentally measured conduc-tances for the 500 661 PCM array, showing variability and stuck-ON pixelrate. Insets shows the measured RESET accuracy, and the rate and stochasticityof G-response, plotted as a colormap of G-per-pulse vs. G.
20
22 Modeledconductance
1responses
18
20 conductanceresponseG [uS]
p
Modeled
14
16G [ ]Medianresponse
Modeledconductancevariability
G per pulse 100%
10
12response
4
6G-per-pulse[uS]
100%
0%
50%
6
82
4
2
4
5 10 15 20-10 G [uS]
0 3 6 9 12 15 182
Pulse #
5 10 15 20
Fig. 18 Fitted G-response vs. # of pulses (blue average, red 1 responses),obtained from our computer model (inset) for the rate and stochasticity of G-response (G-per-pulse vs. G) matched to experiment (Fig.17).
precisely reproduce the measured accuracy trends (Fig.16).
Tolerancing and power considerationsWe use this matched NN simulation to explore the importance
-
100 On Trainingch
s] a) b) c) d) set e)
60
80
ter
20 e
poc
On Test set
set
20
40
Relative Relative G-transition
Relative max-G
Percentagedead
Percentagestuck-ONcy
[%, a
ft
0.1 1 100stochasticity
0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100
slope variability conductances conductances
ccur
ac
f) g) h) i) j)
60
80
ted
ac
20
40
# of MNISTexamples between RESET
A
Mini-batchi
RelativeRelative
slope (f) of imul
at
1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response
p ( )
Si
Fig. 19 Matched simulations show that an NVM-based NN is highly robustto stochasticity, variable maxima, the presence of non-responsive devices, andinfrequent or inaccurate RESETs. Mini-batch size = 1 avoids the need to ac-cumulate weight updates before applying them. However, the nonlinear andasymmetricG-response limits accuracy to85%, and require learning-rate andneuron-response (f ) to be precisely tuned.
100%] Linear, bounded,
80
100
cy [% symmetric, bi-directional
60
80
cura Non-linear, bounded, symmetric, bi-directional
asymmetric40
60
d ac
c
Linear, bounded,
nse
asymmetric, bi-directional
20
40
ulat
e
eG
-res
po
0
20 TrainingepochSi
mu
Rela
tive
Decreasing G-responseIncreasing G-response
0 5 10 15 200
S
Fig. 20 NN performance is improved if G-response is linear and symmetric(green curve) rather than nonlinear (red). However, asymmetry between the up-and down-going G-responses (blue), if not corrected in the weight-update rule(Fig.7), can strongly degrade performance by favoring particular regions of theG-diamond (Figs.10,11).
of NVM imperfections. Fig. 19 shows final training (test) accu-racy as a function of variations in NVM and NN parameters. NNperformance is highly robust to stochasticity, variable maxima, thepresence of non-responsive devices, and infrequent RESETs. Amini-batch of size 1 allows weight updates to be applied immedi-ately. However, as mentioned earlier, nonlinearity and asymmetry inG-response limit the maximum possible accuracy (here, to 85%),and require precise tuning of the learning rate and neuron-response(f ). Too low a learning rate and no weights receive any updates; toohigh, and the imperfections in the NVM response generate chaos.
NN performance with NVM-based synapses offers high accu-racy if G-response is linear and symmetric (Fig. 20, green curve)rather than nonlinear (red curve). Asymmetry in G-response (bluecurve) strongly degrades performance. While the asymmetric G-response of PCM makes it necessary to occasionally stop training,measure all conductances, and apply RESETs and iterative SETs,energy usage can be reasonable if RESETs are infrequent (Fig.21,inset), and learning rate is low (Fig.21).
ConclusionsUsing 2 phase-change memory (PCM) devices per synapse, a
3layer perceptron with 164,885 synapses was trained with back-
7 [ 3pJ SET 30pJ RESET30
gy[m
J] PCM[ 3pJ SET, 30pJ RESET]
6
7
mJ] PCM
[ 3pJ SET, 30pJ RESETapplied each 50 examples]
3
10
ng E
nerg
5
rgy
[m
10 30 100 300 1000
1 Trai
ni
# of MNISTExamples between RESETs
4
Ener
10 30 100 300 1000
2
3
ning
Bi-directionalNVM
1
2
Trai
n NVM[ 1pJ SET & RESET ]
0.1 0.2 0.5 1 2 50
T
Learning rate
Fig. 21 Despite the higher power involved in RESET rather than partial-SET (30pJ and 3pJ for highly-scaled PCM [1]), total energy costs of trainingcan be minimized if RESETs are sufficiently infrequent (inset). Low-energytraining requires low learning rates, which minimizes the number of synapticprogramming pulses. At higher learning rates, even a bi-directional, linearRRAM requiring no RESET and offering low-power (1pJ per pulse) can lead tolarge training energy.
Table of conclusionsLarge 3-layer network with 2 PCM Moderately high accuracy (82%) devices/synapse. (Figs. 1-3, 5) Back-propagation weight update rule compatible with crossbar array. (Figs. 6-8)
achieved on MNIST handwritten digit recognition with two training epochs. (Fig. 16)
NVM models identified issues for training: Conductance bounds, nonlinearity, and asymmetry must be considered. (Figs. 9-10, 20)
PCM response and asymmetry mitigated by RESET strategy, mappingof response, choice of update pulse. (Figs. 11, 12, 17)be considered. (Figs. 9 10, 20) (Figs. 11, 12, 17)
Model of PCM allows well-matched simulation of experiment (Figs. 16-18), variation of network parameters allows tolerancing (Fig 19)
NN is resilient to NVM variations (Figs. 19a-e) and RESET strategy (Figs. 19f-g), but sensitive to learning rate and neuron response function (Figs 19i j)allows tolerancing. (Fig. 19) neuron response function. (Figs. 19i-j)
Bidirectional NVM with no special RESET strategy and good performance requires scheme for
For PCM, keeping RESET frequencydown and learning rate above freeze-out threshold allows reasonable
symmetric response. (Fig. 20) training energy. (Fig. 21)
Fig. 22 NN built with NVM-based synapses tend to be highly sensitive togradient effects (nonlinearity and asymmetry in G-response) that steer allsynaptic weights towards either high or low values, yet are highly resilient torandom effects (NVM variability, yield, and stochasticity).
propagation on a subset (5000 examples) of the MNIST database ofhandwritten digits to high accuracy of (82.2%, 82.9% on test set).A weight-update rule compatible for NVM+selector crossbar arrayswas developed; the G-diamond concept illustrates issues createdby nonlinearity and asymmetry in NVM conductance response. Us-ing a neural network (NN) simulator matched to the experimentaldemonstrator, extensive tolerancing was performed (Fig.22). NVM-based NN are highly resilient to random effects (NVM variability,yield, and stochasticity), but highly sensitive to gradient effectsthat act to steer all synaptic weights. A learning-rate just highenough to avoid network freeze-out is shown to be advantageousfor both high accuracy and low training energy.
References[1] B. Jackson et al., ACM J. Emerg. Tech. Comput. Syst., 9(2), 12 (2013).[2] M. Suri et al., IEDM Tech. Digest, 4.4 (2011).[3] D. Rumelhart et al., Parallel Distributed Processing, MIT Press (1986).[4] Y. LeCun et al., Proc. IEEE, 86(11), 2278 (1998).[5] G. Hinton et al., Science, 313(5786) 504 (2006).[6] M. Breitwisch et al., VLSI Tech. Symp., T6B-3 (2007).[7] B. Rajendran et al., IEEE Trans. Elect. Dev., 60(1), 246 (2013).
-
Selector deviceSynaptic weight
NVMN1
N2
N1
M1
M2P
P1
O1pairs
Conductance
1
N2
P2
O2
Nn
Oo+ - + -
Nn MmMmMm Pp
M1+
Fig. 1 Neuro-inspired non-Von Neumann comput-ing [1,2], in which neurons activate each other throughdense networks of programmable synaptic weights, canbe implemented using dense crossbar arrays of non-volatile memory (NVM) and selector device-pairs.
250 125 10528 250hiddenneurons
125hiddenneurons
10outputneurons
0
x1Cropped(22x24
528input
neurons
A
A
x1B
wij
(22x24pixel)MNISTimages
1
xiA
xjB
ij
x wA
xj8
B
xi wij
xj =f(xi wij)
xj9
x528
B AA
x250B
Fig. 2 In forward evaluation of a multilayer per-ceptron, each layers neurons drive the next layerthrough weights wij and a nonlinearity f(). In-put neurons are driven by pixels from successiveMNIST images (cropped to 2224); the 10 outputneurons identify which digit was presented.
250 125 10528 C t
X1B C
250hiddenneurons
125hiddenneurons
10outputneurons
528inputneurons
0 D
Correct answerD
=x1 g1
B C 1 D D=x2 g2
j k lB C 8
D
D D=xl gl
Xi
X528
B C
wij = * xi * j9
learning rate
D=x10
D
NN result
g10i
learning rate
Fig. 3 In supervised learning, error terms jare back-propagated, adjusting each weight wijto minimize an energy function by gradient de-scent, reducing classification error between com-puted (xDl ) and desired output vectors (gl).
Simulated Trained only on first 5,000 MNIST images99.6
Simulated accuracy [%]
Trained on
MNIST images
99a ed o
all 60,000 MNIST images
T t97
After training, the ability of an NN to
Testaccuracies
80
90te t a g, t e ab ty o a to
generalize is typically assessed by evaluating accuracy on a previously-
unseen test set of 10,000 images
0
50 Training epoch(During each epoch, each training image is presented once)
0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)
Fig. 4 A 3layer perceptron network can clas-sify previously unseen (test) MNIST handwrit-ten digits with up to97% accuracy[4]. Trainingon a subset of the images sacrifices some gener-alization accuracy but speeds up training.
xiGij Gij
+
Signal (voltage, time)
xi+2
xi+1ij
Gi+1j
ij
Gi+1j+
+
Ij+ xi Gij+
Ij
xi GijMj
Gi+2j Gi+2j+
-Fig. 5 By comparing total read signal be-tween pairs of bitlines, summation of synap-tic weights (encoded as conductance differ-ences, wij =G+G) is highly parallel.
wij =*xi *j
xiNeuron value
Back-p
*xi *j
ation ropagated co
j
occu
rren
ces
durin
g si
mul
a
orrection
Fig. 6 Back-propagation callsfor each weightto be updated byw = xiwij ,where is thelearning rate.Colormap showslog(occurrences),in the 1st layer,during NN train-ing (blue curve,Fig. 4); whitecontours identifythe quantizedincrease in theinteger weight.
xiNeuron
value
Back-
fire 1 pulse(timeslot A)
2 pulses(A,B)
3 pulses(A,B,C)
ALL4 pulses
NO pulses
xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and
jted corre
112
on
xi andj )
jection
1fire
durin
g si
mul
atio
1 11 pulse(timeslot A)
NO pulses
occu
rren
ces
d
1
2
2
31
2 pulses(A,C)
3 pulses(A,B,C)Programming
l 2 3 41( , , )
ALL4 pulses
pulsesaffecting
NVM
Fig. 7 In a crossbar, efficient learning requires neuronsto update weights in parallel, firing pulses whose overlapat the various NVM devices implements training.
Conventional weight update
80
90
cy [%
] 100Conventional weight update (Fig. 6)
Crossbar-compatibleweight update (Fig. 7)
60
70
80
cura
c
99.6
99.9
C ti l
40
50
60
ed a
cc
97
99
Testi
Conventionalweight update
20
30
ulat
e
050
8090 accuracies
0 5 10 15 200
10
Sim 0 2 4 6 8 10 12 14 16 18 20
0
Training epoch
Fig. 8 Computer NN simulations showthat a crossbar-compatible weight-updaterule (Fig.7) is just as effective as the con-ventional update rule (Fig.6).
G G A tDevices Stuck-ON
G GNonlinearity(G-per-pulse @ low-G different
from G-per-pulse @ high-G)
Asymmetry(G-per-pulse for increasing G
different from G-per-pulse for decreasing G,
e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax
Stochasticity(erroneous weight updates
ibl )are possible)
Increasing G-responseDecreasingG-response
# pulses # pulses
p
Dead devices
Fig. 9 The conductance response of an NVM deviceexhibits imperfections, including nonlinearity, stochas-ticity, varying maxima, asymmetry between increas-ing/decreasing responses, and non-responsive devices(at low or high G).
[%]
Linear, unbounded100Linear, bounded, bi-directional
A80
acy
[
G IncreasingG-response DecreasingG-response
60
ccur
a
Non-
# pulses # pulses
40ed a
c
Linear, bounded,uni-directional
Nonlinear, bounded,uni-directional
B
20
mul
at 1kG+ G+ G+ B
100Randomi iti l i ht
0
Sim
1G AG G10
Binitial weightdistribution
00 4 8 12 16 20
Training epochFig. 10 Bounding G values reduces NN training accuracyslightly, but unidirectionality and nonlinearity in G-responsestrongly degrade accuracy. Figure insets map NVM-pairsynapse states on a diamond-shaped plot of G+ vs. G
(weight is vertical position) for a sampled subset of the weights.
G+ CA G
+
ghtA +
C# pulses W
eig
BC
DG- D
B
D
# pulsesC
AB
G- Fig. 11 If G values can only be increased (asym-metric G-response), a synapse at point A (G+ sat-urated) can only increase G, leading to a lowweight value (B). If response at small G valuesdiffers from that at largeG (nonlinearG-response),alternating weight updates can no longer cancel. Assynapses tend to get herded into the same portion ofthe G-diamond (C D), the decrease in averageweight can lead to network freeze-out.
100
chs] on
90
95
ter
20 e
poc Training
set
85
90
cy [%
, aft
75
80
ccur
ac onTest
set70
75
ted
ac set
10 100 100060
65
imul
at
10 100 1000# of MNIST examples between RESETsS
i
Fig. 12 Synapses with large conductancevalues (inset, right edge of G-diamond) canbe refreshed (moved left) while preservingthe weight (to some accuracy), by RESETsto bothG followed by a partial SET of one. Ifsuch RESETs are too infrequent, weight evo-lution stagnates and NN accuracy degrades.
-
M2
V1M1
ILDvia
GSTM1 M1
via viaFET
12
Fig. 13 Mushroom-cell [6],1T1R PCM devices (180nmnode) with 2 metal interconnectlayers enable 5121024 arrays.
R fRead mode Write modeReferencecurrent Iref
Sense Amplifier
ReferencevoltageVdriveWrite
Driver
ID > IrefYES or NO?
Device
ID VdriveMaster Bitline selectedby column address
PCM
Devicecurrent ID
Row addressselects
wordline
Fig. 14 A 1-bit sense amplifiermeasuresG values,passing the datato software-basedneurons. Con-ductances are in-creased by identi-cal 25ns partialSET pulses to in-crease G+ (G)(Fig. 7), or by RE-SETs to bothG fol-lowed by an itera-tive SET procedure(Fig.12).
EVENTUAL CROSSBAR IMPLEMENTATION
OUR IMPLEMENTATIONHARDWARE
FULL HARDWAREWrite random weights
to PCM G values
Read PCM G valuesRandomize weights
Present next MNIST example
TIO
N
SOFTWARE
Compute f( xi wij )
Measure I+ - I- xi wij
Non-linearity in Repeat for
Repeat for ealayer, left to
WA
RD P
ROPA
GAT
Compute j corrections
hardware neuron
yer
Repeat for each NN layer
ach NN
rightFO
RW
p j
Compute pulses to apply (Fig. 7) to each
synapse (G+ or G-)
Compute j corrections
B th t ( ) d t fo
r ea
ch N
N la
y
Repeat for ealeft to
GAT
ION
Both upstream (xi) and downstream neurons (j)
apply pulses (Fig. 7), programming synapse
Repe
at
ach NN
layer, right
BACK
PRO
PAG
After each mini-batch (5 examples), apply
pulses to PCM G values
After 20 examples, 2 RESETs (+1 incremental SET) for
synapses with large G values
Fig. 15 Although G values aremeasured sequentially, weightsummation and weight updateprocedures in our software-based neurons closely mimicthe column (and row) integra-tions and pulse-overlap pro-gramming needed for paralleloperations across a crossbar ar-ray. However, since occasionalRESET is triggered when bothG+ andG are large, serial de-vice access is required to obtainand then re-program individualconductances.
90
100
%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM
80
90
acy
[%
Experiment
60
70
accu
r
8090
100 Matched simulation
40
50
ntal
a
506070
20
30
rim
en
Map offinal 10
203040
10
20
Training epochExpe
finalPCM 0 5 10 15 200
10
conductances0 1 2
0g p
Fig. 16 Training and test accuracy for a 3layer perceptron of164,885 hardware-synapses, with all weight operations takingplace on a 500 661 array of mushroom-cell [6] PCM devices(Fig. 13). Also shown is a matched computer simulation ofthis NN, using parameters extracted from the experiment.
5% devices stuck-ONMinimum G-response
80%
90%
100% Minimum G response
Initial 6MeasuredG-per-pulse
100%
50%
60%
70%
80%
ence
s weights
Final 2
4
p p[uS]
0%
40%
50%
60%
ccur
re
330,500 PCM devices
weightsFrom all 31 million partial-SET pulses
5 10 15 20-10
G [uS]
20%
30%Oc PCM devices
MaximumG-response
5 10 15 20Measured RESET
12k
8kWeightafter
1000
0
from all 708kRESETs
Modeledvariability
0 10 20 30 40 500
10%
M d d t
4k
0Weight before
RESET-1000
-1000 10000
Measured conductance [uS]Fig. 17 50-point cumulative distributions of ex-perimentally measured conductances for the 500661 PCM array, showing variability and stuck-ONpixel rate. Insets shows the measured RESET accu-racy, and the rate and stochasticity of G-response,plotted as a colormap of G-per-pulse vs. G.
20
22 Modeledconductance
1responses
18
20 conductanceresponseG [uS]
p
Modeled
14
16G [ ]Medianresponse
Modeledconductancevariability
G per pulse 100%
10
12response
4
6G-per-pulse[uS]
100%
0%
50%
6
82
4
2
4
5 10 15 20-10 G [uS]
0 3 6 9 12 15 182
Pulse #
5 10 15 20
Fig. 18 Fitted G-response vs. # of pulses(blue average, red 1 responses), obtainedfrom our computer model (inset) for the rateand stochasticity of G-response (G-per-pulse vs. G) matched to experiment (Fig.17).
100 On Training
chs] a) b) c) d) set e)
60
80
ter
20 e
poc
On Test set
set
20
40
Relative Relative G-transition
Relative max-G
Percentagedead
Percentagestuck-ONcy
[%, a
ft
0.1 1 100stochasticity
0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100
slope variability conductances conductances
ccur
ac
f) g) h) i) j)
60
80
ted
ac
20
40
# of MNISTexamples between RESET
A
Mini-batchi
RelativeRelative
slope (f) of imul
at
1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response
p ( )
Si
Fig. 19 Matchedsimulations show thatan NVM-based NN ishighly robust to stochas-ticity, variable maxima,the presence of non-responsive devices, andinfrequent or inaccurateRESETs. Mini-batchsize = 1 avoids the needto accumulate weight up-dates before applyingthem. However, thenonlinear and asymmet-ric G-response limits ac-curacy to 85%, and re-quire learning-rate andneuron-response (f ) tobe precisely tuned.
100%] Linear, bounded,
80
100
cy [% symmetric, bi-directional
60
80
cura Non-linear, bounded, symmetric, bi-directional
asymmetric40
60d
acc
Linear, bounded,
nse
asymmetric, bi-directional
20
40
ulat
e
eG
-res
po
0
20 TrainingepochSi
mu
Rela
tive
Decreasing G-responseIncreasing G-response
0 5 10 15 200
S
Fig. 20 NN performance is improved if G-response is linear and symmetric (green curve)rather than nonlinear (red). However, asym-metry between the up- and down-going G-responses (blue), if not corrected in the weight-update rule (Fig.7), can strongly degrade per-formance by favoring particular regions of theG-diamond (Figs.10,11).
7 [ 3pJ SET 30pJ RESET30
gy[m
J] PCM[ 3pJ SET, 30pJ RESET]
6
7
mJ] PCM
[ 3pJ SET, 30pJ RESETapplied each 50 examples]
3
10
ng E
nerg
5
rgy
[m
10 30 100 300 1000
1 Trai
ni
# of MNISTExamples between RESETs
4
Ener
10 30 100 300 1000
2
3
ning
Bi-directionalNVM
1
2
Trai
n NVM[ 1pJ SET & RESET ]
0.1 0.2 0.5 1 2 50
T
Learning rate
Fig. 21 Despite the higherpower involved in RESETrather than partial-SET (30pJand 3pJ for highly-scaledPCM [1]), total energy costsof training can be minimizedif RESETs are sufficiently in-frequent (inset). Low-energytraining requires low learn-ing rates, which minimizesthe number of synaptic pro-gramming pulses. At higherlearning rates, even a bi-directional, linear RRAM re-quiring no RESET and offer-ing low-power (1pJ per pulse)can lead to large training en-ergy.
Table of conclusionsLarge 3-layer network with 2 PCM Moderately high accuracy (82%) devices/synapse. (Figs. 1-3, 5) Back-propagation weight update rule compatible with crossbar array. (Figs. 6-8)
achieved on MNIST handwritten digit recognition with two training epochs. (Fig. 16)
NVM models identified issues for training: Conductance bounds, nonlinearity, and asymmetry must be considered. (Figs. 9-10, 20)
PCM response and asymmetry mitigated by RESET strategy, mappingof response, choice of update pulse. (Figs. 11, 12, 17)be considered. (Figs. 9 10, 20) (Figs. 11, 12, 17)
Model of PCM allows well-matched simulation of experiment (Figs. 16-18), variation of network parameters allows tolerancing (Fig 19)
NN is resilient to NVM variations (Figs. 19a-e) and RESET strategy (Figs. 19f-g), but sensitive to learning rate and neuron response function (Figs 19i j)allows tolerancing. (Fig. 19) neuron response function. (Figs. 19i-j)
Bidirectional NVM with no special RESET strategy and good performance requires scheme for
For PCM, keeping RESET frequencydown and learning rate above freeze-out threshold allows reasonable
symmetric response. (Fig. 20) training energy. (Fig. 21)
Fig. 22 NN built with NVM-based synapsestend to be highly sensitive to gradienteffects (nonlinearity and asymmetry in G-response) that steer all synaptic weightstowards either high or low values, yet arehighly resilient to random effects (NVMvariability, yield, and stochasticity).
References[1] B. Jackson et al., ACM J.
Emerg. Tech. Comput.Syst., 9(2), 12 (2013).
[2] M. Suri et al., IEDM Tech.Digest, 4.4 (2011).
[3] D. Rumelhart et al., Paral-lel Distributed Processing,MIT Press (1986).
[4] Y. LeCun et al., Proc.IEEE, 86(11), 2278 (1998).
[5] G. Hinton et al., Science,313(5786) 504 (2006).
[6] M. Breitwisch et al., VLSITech. Symp., T6B-3 (2007).
[7] B. Rajendran et al., IEEETrans. Elect. Dev.,60(1), 246 (2013).