experimental demonstration and tolerancing of a large ... · experimental demonstration and...

6
Experimental demonstration and tolerancing of a large-scale neural network (165,000 synapses), using phase-change memory as the synaptic weight element G.W.Burr, R.M.Shelby, C.di Nolfo, J.W.Jang , R.S.Shenoy, P.Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, and H. Hwang IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, Tel: (408) 927–1512, E-mail: [email protected] Pohang University of Science and Technology, Pohang, South Korea Abstract Using 2 phase-change memory (PCM) devices per synapse, a 3–layer perceptron network with 164,885 synapses is trained on a subset (5000 examples) of the MNIST database of handwritten digits using a backpropagation variant suitable for NVM+selector crossbar arrays, obtaining a training (generalization) accuracy of 82.2% (82.9%). Using a neural network (NN) simulator matched to the experimental demonstrator, extensive tolerancing is performed with respect to NVM variability, yield, and the stochasticity, linear- ity and asymmetry of NVM-conductance response. Introduction Dense arrays of nonvolatile memory (NVM) and selector device- pairs (Fig.1) can implement neuro-inspired non-Von Neumann com- puting [1,2], using pairs [2] of NVM devices as programmable (plastic) bipolar synapses. Work to date has emphasized the Spike- Timing-Dependent-Plasticity (STDP) algorithm [1,2], motivated by synaptic measurements in real brains, yet experimental NVM demonstrations have been limited in size (100 synapses). Unlike STDP, backpropagation [3] is a widely-used, well-studied NN, offering benchmark-able performance on datasets such as hand- written digits (MNIST) [4]. In forward evaluation of a multilayer perceptron, each layer’s inputs (xi ) drive the next layer’s neurons Selector device Synaptic weight NVM N 1 N 2 N 1 M 1 M 2 P P 1 O 1 pairs Conductance 1 N 2 P 2 O 2 N n O o + - + - N n M m M m M m P p M 1 + Fig. 1 Neuro-inspired non-Von Neumann computing [1,2], in which neurons activate each other through dense networks of programmable synaptic weights, can be implemented using dense crossbar arrays of nonvolatile memory (NVM) and selector device-pairs. 250 125 10 528 250 hidden neurons 125 hidden neurons 10 output neurons “0” x 1 Cropped (22x24 528 input neurons A A x 1 B w ij (22x24 pixel) MNIST images “1” x i A x j B ij x w A x j “8” B x i w ij x j = f( x i w ij ) x j “9” x 528 B A A x 250 B Fig. 2 In forward evaluation of a multilayer perceptron, each layer’s neurons drive the next layer through weights w ij and a nonlinearity f (). Input neurons are driven by pixels from successive MNIST images (cropped to 22×24); the 10 output neurons identify which digit was presented. through weights wij and a nonlinearity f () (Fig. 2). Supervised learning occurs (Fig. 3) by back-propagating error terms δj to ad- just each weight wij . A 3–layer network is capable of accuracies, on previously unseen ‘test’ images (generalization), of 97% [4] (Fig.4); even higher accuracy is possible by first “pre-training” the weights in each layer [5]. Like STDP, low-power neurons should be achievable by emphasizing brief spikes[7] and local-only clocking. Considerations for a crossbar implementation By encoding synaptic weight in conductance difference between paired NVMs, wij = G + - G - [2], forward propagation simply compares total read signal on bitlines (Fig.5). However, backprop- agation [3] calls for weight updates Δw xi δj (Fig. 6), requiring upstream i and downstream j neurons to exchange information for each synapse. In a crossbar, learning becomes much more effi- cient when neurons modify weights in parallel, by firing pulses whose overlap at the various NVM devices implements training [1] (Fig.7). Fig.8 shows, using a simulation of the NN in Figs.2,3, that this adaptation for NVM implementation has no effect on accuracy. However, the conductance response of any real NVM device exhibits imperfections that could still decidedly affect NN per- formance, including nonlinearity, stochasticity, varying maxima, asymmetry between increasing/decreasing responses, and non-re- sponsive devices at low or high conductance (Fig.9). This paper explores the relative importance of each of these factors. 250 125 10 528 C t X 1 B C 250 hidden neurons 125 hidden neurons 10 output neurons 528 input neurons “0” D Correct answer D =x 1 g 1 B C “1” D D =x 2 g 2 j k l B C “8” D D D =x l g l X i X 528 B C w ij = * x i * j “9” “learning rate” D =x 10 D NN result g 10 i learning rate Fig. 3 In super- vised learning, error terms δ j are back-propagated, adjusting each weight w ij to minimize an “en- ergy” function by gradient descent, reducing clas- sification error between com- puted (x D l ) and desired output vectors (g l ). Simulated Trained only on first 5,000 MNIST images 99.6 Simulated accuracy [%] T rained on MNIST images 99 all 60,000 MNIST images “T t” 97 After training, the ability of an NN to “Test” accuracies 80 90 generalize is typically assessed by evaluating accuracy on a previously- unseen “test” set of 10,000 images 0 50 Training epoch (During each “epoch, each training image is presented once) 0 5 10 15 20 0 x i G ij G ij + Signal (voltage, time)… x i+2 x i+1 ij G i+1j ij G i+1j + + I j + x i G ij + I j x i G ij M j G i+2j G i+2j + - Fig. 4 A 3–layer perceptron network can clas- sify previously unseen (’test’) MNIST hand- written digits with up to 97% accuracy[4]. Training on a subset of the images sacrifices some generalization accuracy but speeds up training. Fig. 5 By comparing total read signal between pairs of bitlines, summation of synaptic weights (encoded as conductance differ- ences, w ij = G + - G - ) is highly parallel.

Upload: trinhthu

Post on 02-Jul-2018

238 views

Category:

Documents


0 download

TRANSCRIPT

  • Experimental demonstration and tolerancing of a large-scale neural network(165,000 synapses), using phase-change memory as the synaptic weight element

    G.W.Burr, R.M.Shelby, C.di Nolfo, J.W.Jang, R.S.Shenoy, P.Narayanan,K.Virwani, E.U.Giacometti, B.Kurdi, and H.Hwang

    IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, Tel: (408) 9271512, E-mail: [email protected] University of Science and Technology, Pohang, South Korea

    AbstractUsing 2 phase-change memory (PCM) devices per synapse, a

    3layer perceptron network with 164,885 synapses is trained ona subset (5000 examples) of the MNIST database of handwrittendigits using a backpropagation variant suitable for NVM+selectorcrossbar arrays, obtaining a training (generalization) accuracy of82.2% (82.9%). Using a neural network (NN) simulator matched tothe experimental demonstrator, extensive tolerancing is performedwith respect to NVM variability, yield, and the stochasticity, linear-ity and asymmetry of NVM-conductance response.

    IntroductionDense arrays of nonvolatile memory (NVM) and selector device-

    pairs (Fig.1) can implement neuro-inspired non-Von Neumann com-puting [1,2], using pairs [2] of NVM devices as programmable(plastic) bipolar synapses. Work to date has emphasized the Spike-Timing-Dependent-Plasticity (STDP) algorithm [1,2], motivatedby synaptic measurements in real brains, yet experimental NVMdemonstrations have been limited in size (100 synapses).

    Unlike STDP, backpropagation [3] is a widely-used, well-studiedNN, offering benchmark-able performance on datasets such as hand-written digits (MNIST) [4]. In forward evaluation of a multilayerperceptron, each layers inputs (xi) drive the next layers neurons

    Selector deviceSynaptic weight

    NVMN1

    N2

    N1

    M1

    M2P

    P1

    O1pairs

    Conductance

    1

    N2

    P2

    O2

    Nn

    Oo+ - + -

    Nn MmMmMm Pp

    M1+

    Fig. 1 Neuro-inspired non-Von Neumann computing [1,2], in which neuronsactivate each other through dense networks of programmable synaptic weights,can be implemented using dense crossbar arrays of nonvolatile memory (NVM)and selector device-pairs.

    250 125 10528 250hiddenneurons

    125hiddenneurons

    10outputneurons

    0

    x1Cropped(22x24

    528input

    neurons

    A

    A

    x1B

    wij

    (22x24pixel)MNISTimages

    1

    xiA

    xjB

    ij

    x wA

    xj8

    B

    xi wij

    xj =f(xi wij)

    xj9

    x528

    B AA

    x250B

    Fig. 2 In forward evaluation of a multilayer perceptron, each layers neuronsdrive the next layer through weights wij and a nonlinearity f(). Input neuronsare driven by pixels from successive MNIST images (cropped to 2224); the 10output neurons identify which digit was presented.

    through weights wij and a nonlinearity f() (Fig. 2). Supervisedlearning occurs (Fig. 3) by back-propagating error terms j to ad-just each weight wij . A 3layer network is capable of accuracies,on previously unseen test images (generalization), of 97% [4](Fig.4); even higher accuracy is possible by first pre-training theweights in each layer [5]. Like STDP, low-power neurons should beachievable by emphasizing brief spikes[7] and local-only clocking.

    Considerations for a crossbar implementationBy encoding synaptic weight in conductance difference between

    paired NVMs, wij = G+G [2], forward propagation simplycompares total read signal on bitlines (Fig.5). However, backprop-agation [3] calls for weight updates w xij (Fig. 6), requiringupstream i and downstream j neurons to exchange information foreach synapse. In a crossbar, learning becomes much more effi-cient when neurons modify weights in parallel, by firing pulseswhose overlap at the various NVM devices implements training [1](Fig.7). Fig.8 shows, using a simulation of the NN in Figs.2,3, thatthis adaptation for NVM implementation has no effect on accuracy.

    However, the conductance response of any real NVM deviceexhibits imperfections that could still decidedly affect NN per-formance, including nonlinearity, stochasticity, varying maxima,asymmetry between increasing/decreasing responses, and non-re-sponsive devices at low or high conductance (Fig. 9). This paperexplores the relative importance of each of these factors.

    250 125 10528 C t

    X1B C

    250hiddenneurons

    125hiddenneurons

    10outputneurons

    528inputneurons

    0 D

    Correct answerD

    =x1 g1

    B C 1 D D=x2 g2

    j k lB C 8

    D

    D D=xl gl

    Xi

    X528

    B C

    wij = * xi * j9

    learning rate

    D=x10

    D

    NN result

    g10i

    learning rate

    Fig. 3 In super-vised learning,error terms j areback-propagated,adjusting eachweight wij tominimize an en-ergy function bygradient descent,reducing clas-sification errorbetween com-puted (xDl ) anddesired outputvectors (gl).

    Simulated Trained only on first 5,000 MNIST images99.6

    Simulated accuracy [%]

    Trained on

    MNIST images

    99a ed o

    all 60,000 MNIST images

    T t97

    After training, the ability of an NN to

    Testaccuracies

    80

    90te t a g, t e ab ty o a to

    generalize is typically assessed by evaluating accuracy on a previously-

    unseen test set of 10,000 images

    0

    50 Training epoch(During each epoch, each training image is presented once)

    0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)

    xiGij Gij

    +

    Signal (voltage, time)

    xi+2

    xi+1ij

    Gi+1j

    ij

    Gi+1j+

    +

    Ij+ xi Gij+

    Ij

    xi GijMj

    Gi+2j Gi+2j+

    -Fig. 4 A 3layer perceptron network can clas-sify previously unseen (test) MNIST hand-written digits with up to 97% accuracy[4].Training on a subset of the images sacrificessome generalization accuracy but speeds uptraining.

    Fig. 5 By comparing total readsignal between pairs of bitlines,summation of synaptic weights(encoded as conductance differ-ences, wij = G+ G) ishighly parallel.

  • wij =*xi *j

    xiNeuron value

    Back-p

    *xi *j

    ation ropagated co

    j

    occu

    rren

    ces

    durin

    g si

    mul

    a

    orrection

    Fig. 6 Back-propagation callsfor each weightto be updated byw = xiwij ,where is thelearning rate.Colormap showslog(occurrences),in the 1st layer,during NN train-ing (blue curve,Fig. 4); whitecontours identifythe quantizedincrease in theinteger weight.

    xiNeuron

    value

    Back-

    fire 1 pulse(timeslot A)

    2 pulses(A,B)

    3 pulses(A,B,C)

    ALL4 pulses

    NO pulses

    xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and

    j

    ted corre

    112

    on

    xi andj )

    j

    ection

    1fire

    durin

    g si

    mul

    atio

    1 11 pulse(timeslot A)

    NO pulses

    occu

    rren

    ces

    d

    1

    2

    2

    31

    2 pulses(A,C)

    3 pulses(A,B,C)Programming

    l 2 3 41( , , )

    ALL4 pulses

    pulsesaffecting

    NVM

    Fig. 7 Ina crossbar,efficient learn-ing requiresneurons toupdate weightsin parallel,firing pulseswhose overlapat the variousNVM devicesimplementstraining.

    Conventional weight update

    80

    90

    cy [%

    ] 100Conventional weight update (Fig. 6)

    Crossbar-compatibleweight update (Fig. 7)

    60

    70

    80

    cura

    c

    99.6

    99.9

    C ti l

    40

    50

    60

    ed a

    cc

    97

    99

    Testi

    Conventionalweight update

    20

    30

    ulat

    e

    050

    8090 accuracies

    0 5 10 15 200

    10

    Sim 0 2 4 6 8 10 12 14 16 18 20

    0

    Training epoch

    Fig. 8 Computer NN simulations show that a crossbar-compatible weight-updaterule (Fig.7) is just as effective as the conventional update rule (Fig.6).

    While boundingG values reduces NN training accuracy slightly(Fig.10), unidirectionality and nonlinearity in theG-response strong-ly degrade accuracy. Figure insets (Fig.10) map NVM-pair synapsestates on a diamond-shaped plot ofG+ vs. G (weight is vertical po-sition). In this context (Fig.11), a synapse with a highly asymmetricG-response moves only unidirectionally, from left-to-right. Onceone G is saturated, subsequent training can only increase the otherG value, reducing weight magnitude, deleting trained information,and degrading accuracy. Nonlinearity in G-response further en-courages weights of low value (Fig.11), which can lead to networkfreeze-out (no weight changes, Fig. 10 inset). One solution tothe highly asymmetric response of PCM devices is occasional RE-SET [2], moving synapses back to the left edge of the G-diamondwhile preserving weight value (with iterative SETs, Fig. 12 inset).

    G G A tDevices Stuck-ON

    G GNonlinearity(G-per-pulse @ low-G different

    from G-per-pulse @ high-G)

    Asymmetry(G-per-pulse for increasing G

    different from G-per-pulse for decreasing G,

    e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax

    Stochasticity(erroneous weight updates

    ibl )are possible)

    Increasing G-responseDecreasingG-response

    # pulses # pulses

    p

    Dead devices

    Fig. 9 The conductance response of an NVM device exhibits imperfections,including nonlinearity, stochasticity, varying maxima, asymmetry between in-creasing/decreasing responses, and non-responsive devices (at low or high G).

    [%]

    Linear, unbounded100Linear, bounded, bi-directional

    A80

    acy

    [

    G IncreasingG-response DecreasingG-response

    60

    ccur

    a

    Non-

    # pulses # pulses

    40ed a

    c

    Linear, bounded,uni-directional

    Nonlinear, bounded,uni-directional

    B

    20

    mul

    at 1kG+ G+ G+ B

    100Randomi iti l i ht

    0Si

    m1G AG G10

    Binitial weightdistribution

    00 4 8 12 16 20

    Training epochFig. 10 Bounding G values reduces NN training accuracy slightly, but unidi-rectionality and nonlinearity in G-response strongly degrade accuracy. Figureinsets map NVM-pair synapse states on a diamond-shaped plot of G+ vs. G

    (weight is vertical position) for a sampled subset of the weights.

    G+ CA G

    +

    ghtA +

    C# pulses W

    eig

    BC

    DG- D

    B

    D

    # pulsesC

    AB

    G- Fig. 11 IfG values can only be increased (asymmetricG-response), a synapseat point A (G+ saturated) can only increase G, leading to a low weight value(B). If response at small G values differs from that at large G (nonlinear G-response), alternating weight updates can no longer cancel. As synapses tendto get herded into the same portion of the G-diamond (C D), the decrease inaverage weight can lead to network freeze-out.

    However, if this is not done frequently enough, weight stagnationwill degrade NN accuracy (Fig.12).

    Experimental resultsWe implemented a 3layer perceptron of 164,885 synapses

    (Figs.2,3) on a 500 661 array of mushroom-cell [6], 1T1R PCMdevices (180nm node, Fig.13). While the update algorithm (Fig.7)is fully compatible with a crossbar implementation, our hardwareallows only sequential access to each PCM device (Fig. 14). Forread, a sense amplifier measures G values and thus weights for thesoftware-based neurons, mimicking column- and row-based inte-grations. Weights are increased (decreased) by identical partialSET pulses (Fig.7) to increase G+ (G) (Fig.15). The deviation

  • 100

    chs] on

    90

    95te

    r 20

    epo

    c Trainingset

    85

    90cy

    [%, a

    ft

    75

    80

    ccur

    ac onTest

    set70

    75

    ted

    ac set

    10 100 100060

    65

    imul

    at

    10 100 1000# of MNIST examples between RESETsS

    i

    Fig. 12 Synapses with large conductance values (inset, right edge ofG-diamond)can be refreshed (moved left) while preserving the weight (to some accuracy),by RESETs to bothG followed by a partial SET of one. If such RESETs are tooinfrequent, weight evolution stagnates and NN accuracy degrades.

    M2

    V1M1

    ILDvia

    GSTM1 M1

    via viaFET

    12

    R fRead mode Write modeReferencecurrent Iref

    Sense Amplifier

    ReferencevoltageVdriveWrite

    Driver

    ID > IrefYES or NO?

    Device

    ID VdriveMaster Bitline selectedby column address

    PCM

    Devicecurrent ID

    Row addressselects

    wordline

    Fig. 13 Mushroom-cell [6],1T1R PCM devices (180nmnode) with 2 metal intercon-nect layers enable 5121024arrays.

    Fig. 14 A 1-bit sense amplifier measures Gvalues, passing the data to software-based neu-rons. Conductances are increased by identical25ns partialSET pulses to increaseG+ (G)(Fig. 7), or by RESETs to both G followed byan iterative SET procedure (Fig.12).

    EVENTUAL CROSSBAR IMPLEMENTATION

    OUR IMPLEMENTATIONHARDWARE

    FULL HARDWAREWrite random weights

    to PCM G values

    Read PCM G valuesRandomize weights

    Present next MNIST example

    TIO

    N

    SOFTWARE

    Compute f( xi wij )

    Measure I+ - I- xi wij

    Non-linearity in Repeat for

    Repeat for ealayer, left to

    WA

    RD P

    ROPA

    GAT

    Compute j corrections

    hardware neuron

    yer

    Repeat for each NN layer

    ach NN

    rightFO

    RW

    p j

    Compute pulses to apply (Fig. 7) to each

    synapse (G+ or G-)

    Compute j corrections

    B th t ( ) d t fo

    r eac

    h N

    N la

    y

    Repeat for ealeft to

    GAT

    ION

    Both upstream (xi) and downstream neurons (j)

    apply pulses (Fig. 7), programming synapse

    Repe

    at

    ach NN

    layer, right

    BACK

    PRO

    PAG

    After each mini-batch (5 examples), apply

    pulses to PCM G values

    After 20 examples, 2 RESETs (+1 incremental SET) for

    synapses with large G values

    Fig. 15 Although Gvalues are measuredsequentially, weightsummation and weightupdate procedures in oursoftware-based neuronsclosely mimic the column(and row) integrationsand pulse-overlap pro-gramming needed forparallel operations acrossa crossbar array. How-ever, since occasionalRESET is triggered whenboth G+ and G arelarge, serial device accessis required to obtainand then re-programindividual conductances.

    from true crossbar implementation occurs upon occasional RESET(Fig.12), triggered when either G+ or G are large, thus requiringboth knowledge of and control over individual G values.

    Fig.16 shows measured accuracies for a hardware-synapse NN,with all weight operations taking place on PCM devices. Toreduce test time, weight updates for each mini-batch of 5 MNISTexamples were applied together. Fig.17 plots measuredG-response,stochasticity, variability, stuck-ON pixel rate, and RESET accuracy.By matching all parameters including stochasticity (Fig.18) to thosemeasured during the experiment, a NN computer simulation can

    90

    100

    %] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM

    80

    90

    acy

    [%

    Experiment

    60

    70

    accu

    r

    8090

    100 Matched simulation

    40

    50

    ntal

    a

    506070

    20

    30

    rim

    en

    Map offinal 10

    203040

    10

    20

    Training epochExpe

    finalPCM 0 5 10 15 200

    10

    conductances0 1 2

    0g p

    Fig. 16 Training and test accuracy for a 3layer perceptron of 164,885 hardware-synapses, with all weight operations taking place on a 500 661 array ofmushroom-cell [6] PCM devices (Fig.13). Also shown is a matched computersimulation of this NN, using parameters extracted from the experiment.

    5% devices stuck-ONMinimum G-response

    80%

    90%

    100% Minimum G response

    Initial 6MeasuredG-per-pulse

    100%

    50%

    60%

    70%

    80%

    ence

    s weights

    Final 2

    4

    p p[uS]

    0%

    40%

    50%

    60%

    ccur

    re330,500

    PCM devices

    weightsFrom all 31 million partial-SET pulses

    5 10 15 20-10

    G [uS]

    20%

    30%Oc PCM devices

    MaximumG-response

    5 10 15 20Measured RESET

    12k

    8kWeightafter

    1000

    0

    from all 708kRESETs

    Modeledvariability

    0 10 20 30 40 500

    10%

    M d d t

    4k

    0Weight before

    RESET-1000

    -1000 10000

    Measured conductance [uS]Fig. 17 50-point cumulative distributions of experimentally measured conduc-tances for the 500 661 PCM array, showing variability and stuck-ON pixelrate. Insets shows the measured RESET accuracy, and the rate and stochasticityof G-response, plotted as a colormap of G-per-pulse vs. G.

    20

    22 Modeledconductance

    1responses

    18

    20 conductanceresponseG [uS]

    p

    Modeled

    14

    16G [ ]Medianresponse

    Modeledconductancevariability

    G per pulse 100%

    10

    12response

    4

    6G-per-pulse[uS]

    100%

    0%

    50%

    6

    82

    4

    2

    4

    5 10 15 20-10 G [uS]

    0 3 6 9 12 15 182

    Pulse #

    5 10 15 20

    Fig. 18 Fitted G-response vs. # of pulses (blue average, red 1 responses),obtained from our computer model (inset) for the rate and stochasticity of G-response (G-per-pulse vs. G) matched to experiment (Fig.17).

    precisely reproduce the measured accuracy trends (Fig.16).

    Tolerancing and power considerationsWe use this matched NN simulation to explore the importance

  • 100 On Trainingch

    s] a) b) c) d) set e)

    60

    80

    ter

    20 e

    poc

    On Test set

    set

    20

    40

    Relative Relative G-transition

    Relative max-G

    Percentagedead

    Percentagestuck-ONcy

    [%, a

    ft

    0.1 1 100stochasticity

    0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100

    slope variability conductances conductances

    ccur

    ac

    f) g) h) i) j)

    60

    80

    ted

    ac

    20

    40

    # of MNISTexamples between RESET

    A

    Mini-batchi

    RelativeRelative

    slope (f) of imul

    at

    1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response

    p ( )

    Si

    Fig. 19 Matched simulations show that an NVM-based NN is highly robustto stochasticity, variable maxima, the presence of non-responsive devices, andinfrequent or inaccurate RESETs. Mini-batch size = 1 avoids the need to ac-cumulate weight updates before applying them. However, the nonlinear andasymmetricG-response limits accuracy to85%, and require learning-rate andneuron-response (f ) to be precisely tuned.

    100%] Linear, bounded,

    80

    100

    cy [% symmetric, bi-directional

    60

    80

    cura Non-linear, bounded, symmetric, bi-directional

    asymmetric40

    60

    d ac

    c

    Linear, bounded,

    nse

    asymmetric, bi-directional

    20

    40

    ulat

    e

    eG

    -res

    po

    0

    20 TrainingepochSi

    mu

    Rela

    tive

    Decreasing G-responseIncreasing G-response

    0 5 10 15 200

    S

    Fig. 20 NN performance is improved if G-response is linear and symmetric(green curve) rather than nonlinear (red). However, asymmetry between the up-and down-going G-responses (blue), if not corrected in the weight-update rule(Fig.7), can strongly degrade performance by favoring particular regions of theG-diamond (Figs.10,11).

    of NVM imperfections. Fig. 19 shows final training (test) accu-racy as a function of variations in NVM and NN parameters. NNperformance is highly robust to stochasticity, variable maxima, thepresence of non-responsive devices, and infrequent RESETs. Amini-batch of size 1 allows weight updates to be applied immedi-ately. However, as mentioned earlier, nonlinearity and asymmetry inG-response limit the maximum possible accuracy (here, to 85%),and require precise tuning of the learning rate and neuron-response(f ). Too low a learning rate and no weights receive any updates; toohigh, and the imperfections in the NVM response generate chaos.

    NN performance with NVM-based synapses offers high accu-racy if G-response is linear and symmetric (Fig. 20, green curve)rather than nonlinear (red curve). Asymmetry in G-response (bluecurve) strongly degrades performance. While the asymmetric G-response of PCM makes it necessary to occasionally stop training,measure all conductances, and apply RESETs and iterative SETs,energy usage can be reasonable if RESETs are infrequent (Fig.21,inset), and learning rate is low (Fig.21).

    ConclusionsUsing 2 phase-change memory (PCM) devices per synapse, a

    3layer perceptron with 164,885 synapses was trained with back-

    7 [ 3pJ SET 30pJ RESET30

    gy[m

    J] PCM[ 3pJ SET, 30pJ RESET]

    6

    7

    mJ] PCM

    [ 3pJ SET, 30pJ RESETapplied each 50 examples]

    3

    10

    ng E

    nerg

    5

    rgy

    [m

    10 30 100 300 1000

    1 Trai

    ni

    # of MNISTExamples between RESETs

    4

    Ener

    10 30 100 300 1000

    2

    3

    ning

    Bi-directionalNVM

    1

    2

    Trai

    n NVM[ 1pJ SET & RESET ]

    0.1 0.2 0.5 1 2 50

    T

    Learning rate

    Fig. 21 Despite the higher power involved in RESET rather than partial-SET (30pJ and 3pJ for highly-scaled PCM [1]), total energy costs of trainingcan be minimized if RESETs are sufficiently infrequent (inset). Low-energytraining requires low learning rates, which minimizes the number of synapticprogramming pulses. At higher learning rates, even a bi-directional, linearRRAM requiring no RESET and offering low-power (1pJ per pulse) can lead tolarge training energy.

    Table of conclusionsLarge 3-layer network with 2 PCM Moderately high accuracy (82%) devices/synapse. (Figs. 1-3, 5) Back-propagation weight update rule compatible with crossbar array. (Figs. 6-8)

    achieved on MNIST handwritten digit recognition with two training epochs. (Fig. 16)

    NVM models identified issues for training: Conductance bounds, nonlinearity, and asymmetry must be considered. (Figs. 9-10, 20)

    PCM response and asymmetry mitigated by RESET strategy, mappingof response, choice of update pulse. (Figs. 11, 12, 17)be considered. (Figs. 9 10, 20) (Figs. 11, 12, 17)

    Model of PCM allows well-matched simulation of experiment (Figs. 16-18), variation of network parameters allows tolerancing (Fig 19)

    NN is resilient to NVM variations (Figs. 19a-e) and RESET strategy (Figs. 19f-g), but sensitive to learning rate and neuron response function (Figs 19i j)allows tolerancing. (Fig. 19) neuron response function. (Figs. 19i-j)

    Bidirectional NVM with no special RESET strategy and good performance requires scheme for

    For PCM, keeping RESET frequencydown and learning rate above freeze-out threshold allows reasonable

    symmetric response. (Fig. 20) training energy. (Fig. 21)

    Fig. 22 NN built with NVM-based synapses tend to be highly sensitive togradient effects (nonlinearity and asymmetry in G-response) that steer allsynaptic weights towards either high or low values, yet are highly resilient torandom effects (NVM variability, yield, and stochasticity).

    propagation on a subset (5000 examples) of the MNIST database ofhandwritten digits to high accuracy of (82.2%, 82.9% on test set).A weight-update rule compatible for NVM+selector crossbar arrayswas developed; the G-diamond concept illustrates issues createdby nonlinearity and asymmetry in NVM conductance response. Us-ing a neural network (NN) simulator matched to the experimentaldemonstrator, extensive tolerancing was performed (Fig.22). NVM-based NN are highly resilient to random effects (NVM variability,yield, and stochasticity), but highly sensitive to gradient effectsthat act to steer all synaptic weights. A learning-rate just highenough to avoid network freeze-out is shown to be advantageousfor both high accuracy and low training energy.

    References[1] B. Jackson et al., ACM J. Emerg. Tech. Comput. Syst., 9(2), 12 (2013).[2] M. Suri et al., IEDM Tech. Digest, 4.4 (2011).[3] D. Rumelhart et al., Parallel Distributed Processing, MIT Press (1986).[4] Y. LeCun et al., Proc. IEEE, 86(11), 2278 (1998).[5] G. Hinton et al., Science, 313(5786) 504 (2006).[6] M. Breitwisch et al., VLSI Tech. Symp., T6B-3 (2007).[7] B. Rajendran et al., IEEE Trans. Elect. Dev., 60(1), 246 (2013).

  • Selector deviceSynaptic weight

    NVMN1

    N2

    N1

    M1

    M2P

    P1

    O1pairs

    Conductance

    1

    N2

    P2

    O2

    Nn

    Oo+ - + -

    Nn MmMmMm Pp

    M1+

    Fig. 1 Neuro-inspired non-Von Neumann comput-ing [1,2], in which neurons activate each other throughdense networks of programmable synaptic weights, canbe implemented using dense crossbar arrays of non-volatile memory (NVM) and selector device-pairs.

    250 125 10528 250hiddenneurons

    125hiddenneurons

    10outputneurons

    0

    x1Cropped(22x24

    528input

    neurons

    A

    A

    x1B

    wij

    (22x24pixel)MNISTimages

    1

    xiA

    xjB

    ij

    x wA

    xj8

    B

    xi wij

    xj =f(xi wij)

    xj9

    x528

    B AA

    x250B

    Fig. 2 In forward evaluation of a multilayer per-ceptron, each layers neurons drive the next layerthrough weights wij and a nonlinearity f(). In-put neurons are driven by pixels from successiveMNIST images (cropped to 2224); the 10 outputneurons identify which digit was presented.

    250 125 10528 C t

    X1B C

    250hiddenneurons

    125hiddenneurons

    10outputneurons

    528inputneurons

    0 D

    Correct answerD

    =x1 g1

    B C 1 D D=x2 g2

    j k lB C 8

    D

    D D=xl gl

    Xi

    X528

    B C

    wij = * xi * j9

    learning rate

    D=x10

    D

    NN result

    g10i

    learning rate

    Fig. 3 In supervised learning, error terms jare back-propagated, adjusting each weight wijto minimize an energy function by gradient de-scent, reducing classification error between com-puted (xDl ) and desired output vectors (gl).

    Simulated Trained only on first 5,000 MNIST images99.6

    Simulated accuracy [%]

    Trained on

    MNIST images

    99a ed o

    all 60,000 MNIST images

    T t97

    After training, the ability of an NN to

    Testaccuracies

    80

    90te t a g, t e ab ty o a to

    generalize is typically assessed by evaluating accuracy on a previously-

    unseen test set of 10,000 images

    0

    50 Training epoch(During each epoch, each training image is presented once)

    0 5 10 15 200 ( u g eac epoc , eac t a g age s p ese ted o ce)

    Fig. 4 A 3layer perceptron network can clas-sify previously unseen (test) MNIST handwrit-ten digits with up to97% accuracy[4]. Trainingon a subset of the images sacrifices some gener-alization accuracy but speeds up training.

    xiGij Gij

    +

    Signal (voltage, time)

    xi+2

    xi+1ij

    Gi+1j

    ij

    Gi+1j+

    +

    Ij+ xi Gij+

    Ij

    xi GijMj

    Gi+2j Gi+2j+

    -Fig. 5 By comparing total read signal be-tween pairs of bitlines, summation of synap-tic weights (encoded as conductance differ-ences, wij =G+G) is highly parallel.

    wij =*xi *j

    xiNeuron value

    Back-p

    *xi *j

    ation ropagated co

    j

    occu

    rren

    ces

    durin

    g si

    mul

    a

    orrection

    Fig. 6 Back-propagation callsfor each weightto be updated byw = xiwij ,where is thelearning rate.Colormap showslog(occurrences),in the 1st layer,during NN train-ing (blue curve,Fig. 4); whitecontours identifythe quantizedincrease in theinteger weight.

    xiNeuron

    value

    Back-

    fire 1 pulse(timeslot A)

    2 pulses(A,B)

    3 pulses(A,B,C)

    ALL4 pulses

    NO pulses

    xi -propaga2 31 40(Squareroot of learningrate isused toscalebothx and

    jted corre

    112

    on

    xi andj )

    jection

    1fire

    durin

    g si

    mul

    atio

    1 11 pulse(timeslot A)

    NO pulses

    occu

    rren

    ces

    d

    1

    2

    2

    31

    2 pulses(A,C)

    3 pulses(A,B,C)Programming

    l 2 3 41( , , )

    ALL4 pulses

    pulsesaffecting

    NVM

    Fig. 7 In a crossbar, efficient learning requires neuronsto update weights in parallel, firing pulses whose overlapat the various NVM devices implements training.

    Conventional weight update

    80

    90

    cy [%

    ] 100Conventional weight update (Fig. 6)

    Crossbar-compatibleweight update (Fig. 7)

    60

    70

    80

    cura

    c

    99.6

    99.9

    C ti l

    40

    50

    60

    ed a

    cc

    97

    99

    Testi

    Conventionalweight update

    20

    30

    ulat

    e

    050

    8090 accuracies

    0 5 10 15 200

    10

    Sim 0 2 4 6 8 10 12 14 16 18 20

    0

    Training epoch

    Fig. 8 Computer NN simulations showthat a crossbar-compatible weight-updaterule (Fig.7) is just as effective as the con-ventional update rule (Fig.6).

    G G A tDevices Stuck-ON

    G GNonlinearity(G-per-pulse @ low-G different

    from G-per-pulse @ high-G)

    Asymmetry(G-per-pulse for increasing G

    different from G-per-pulse for decreasing G,

    e g abrupt RESET in PCM)Varying e.g., abrupt RESET in PCM)VaryingGmax

    Stochasticity(erroneous weight updates

    ibl )are possible)

    Increasing G-responseDecreasingG-response

    # pulses # pulses

    p

    Dead devices

    Fig. 9 The conductance response of an NVM deviceexhibits imperfections, including nonlinearity, stochas-ticity, varying maxima, asymmetry between increas-ing/decreasing responses, and non-responsive devices(at low or high G).

    [%]

    Linear, unbounded100Linear, bounded, bi-directional

    A80

    acy

    [

    G IncreasingG-response DecreasingG-response

    60

    ccur

    a

    Non-

    # pulses # pulses

    40ed a

    c

    Linear, bounded,uni-directional

    Nonlinear, bounded,uni-directional

    B

    20

    mul

    at 1kG+ G+ G+ B

    100Randomi iti l i ht

    0

    Sim

    1G AG G10

    Binitial weightdistribution

    00 4 8 12 16 20

    Training epochFig. 10 Bounding G values reduces NN training accuracyslightly, but unidirectionality and nonlinearity in G-responsestrongly degrade accuracy. Figure insets map NVM-pairsynapse states on a diamond-shaped plot of G+ vs. G

    (weight is vertical position) for a sampled subset of the weights.

    G+ CA G

    +

    ghtA +

    C# pulses W

    eig

    BC

    DG- D

    B

    D

    # pulsesC

    AB

    G- Fig. 11 If G values can only be increased (asym-metric G-response), a synapse at point A (G+ sat-urated) can only increase G, leading to a lowweight value (B). If response at small G valuesdiffers from that at largeG (nonlinearG-response),alternating weight updates can no longer cancel. Assynapses tend to get herded into the same portion ofthe G-diamond (C D), the decrease in averageweight can lead to network freeze-out.

    100

    chs] on

    90

    95

    ter

    20 e

    poc Training

    set

    85

    90

    cy [%

    , aft

    75

    80

    ccur

    ac onTest

    set70

    75

    ted

    ac set

    10 100 100060

    65

    imul

    at

    10 100 1000# of MNIST examples between RESETsS

    i

    Fig. 12 Synapses with large conductancevalues (inset, right edge of G-diamond) canbe refreshed (moved left) while preservingthe weight (to some accuracy), by RESETsto bothG followed by a partial SET of one. Ifsuch RESETs are too infrequent, weight evo-lution stagnates and NN accuracy degrades.

  • M2

    V1M1

    ILDvia

    GSTM1 M1

    via viaFET

    12

    Fig. 13 Mushroom-cell [6],1T1R PCM devices (180nmnode) with 2 metal interconnectlayers enable 5121024 arrays.

    R fRead mode Write modeReferencecurrent Iref

    Sense Amplifier

    ReferencevoltageVdriveWrite

    Driver

    ID > IrefYES or NO?

    Device

    ID VdriveMaster Bitline selectedby column address

    PCM

    Devicecurrent ID

    Row addressselects

    wordline

    Fig. 14 A 1-bit sense amplifiermeasuresG values,passing the datato software-basedneurons. Con-ductances are in-creased by identi-cal 25ns partialSET pulses to in-crease G+ (G)(Fig. 7), or by RE-SETs to bothG fol-lowed by an itera-tive SET procedure(Fig.12).

    EVENTUAL CROSSBAR IMPLEMENTATION

    OUR IMPLEMENTATIONHARDWARE

    FULL HARDWAREWrite random weights

    to PCM G values

    Read PCM G valuesRandomize weights

    Present next MNIST example

    TIO

    N

    SOFTWARE

    Compute f( xi wij )

    Measure I+ - I- xi wij

    Non-linearity in Repeat for

    Repeat for ealayer, left to

    WA

    RD P

    ROPA

    GAT

    Compute j corrections

    hardware neuron

    yer

    Repeat for each NN layer

    ach NN

    rightFO

    RW

    p j

    Compute pulses to apply (Fig. 7) to each

    synapse (G+ or G-)

    Compute j corrections

    B th t ( ) d t fo

    r ea

    ch N

    N la

    y

    Repeat for ealeft to

    GAT

    ION

    Both upstream (xi) and downstream neurons (j)

    apply pulses (Fig. 7), programming synapse

    Repe

    at

    ach NN

    layer, right

    BACK

    PRO

    PAG

    After each mini-batch (5 examples), apply

    pulses to PCM G values

    After 20 examples, 2 RESETs (+1 incremental SET) for

    synapses with large G values

    Fig. 15 Although G values aremeasured sequentially, weightsummation and weight updateprocedures in our software-based neurons closely mimicthe column (and row) integra-tions and pulse-overlap pro-gramming needed for paralleloperations across a crossbar ar-ray. However, since occasionalRESET is triggered when bothG+ andG are large, serial de-vice access is required to obtainand then re-program individualconductances.

    90

    100

    %] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM

    80

    90

    acy

    [%

    Experiment

    60

    70

    accu

    r

    8090

    100 Matched simulation

    40

    50

    ntal

    a

    506070

    20

    30

    rim

    en

    Map offinal 10

    203040

    10

    20

    Training epochExpe

    finalPCM 0 5 10 15 200

    10

    conductances0 1 2

    0g p

    Fig. 16 Training and test accuracy for a 3layer perceptron of164,885 hardware-synapses, with all weight operations takingplace on a 500 661 array of mushroom-cell [6] PCM devices(Fig. 13). Also shown is a matched computer simulation ofthis NN, using parameters extracted from the experiment.

    5% devices stuck-ONMinimum G-response

    80%

    90%

    100% Minimum G response

    Initial 6MeasuredG-per-pulse

    100%

    50%

    60%

    70%

    80%

    ence

    s weights

    Final 2

    4

    p p[uS]

    0%

    40%

    50%

    60%

    ccur

    re

    330,500 PCM devices

    weightsFrom all 31 million partial-SET pulses

    5 10 15 20-10

    G [uS]

    20%

    30%Oc PCM devices

    MaximumG-response

    5 10 15 20Measured RESET

    12k

    8kWeightafter

    1000

    0

    from all 708kRESETs

    Modeledvariability

    0 10 20 30 40 500

    10%

    M d d t

    4k

    0Weight before

    RESET-1000

    -1000 10000

    Measured conductance [uS]Fig. 17 50-point cumulative distributions of ex-perimentally measured conductances for the 500661 PCM array, showing variability and stuck-ONpixel rate. Insets shows the measured RESET accu-racy, and the rate and stochasticity of G-response,plotted as a colormap of G-per-pulse vs. G.

    20

    22 Modeledconductance

    1responses

    18

    20 conductanceresponseG [uS]

    p

    Modeled

    14

    16G [ ]Medianresponse

    Modeledconductancevariability

    G per pulse 100%

    10

    12response

    4

    6G-per-pulse[uS]

    100%

    0%

    50%

    6

    82

    4

    2

    4

    5 10 15 20-10 G [uS]

    0 3 6 9 12 15 182

    Pulse #

    5 10 15 20

    Fig. 18 Fitted G-response vs. # of pulses(blue average, red 1 responses), obtainedfrom our computer model (inset) for the rateand stochasticity of G-response (G-per-pulse vs. G) matched to experiment (Fig.17).

    100 On Training

    chs] a) b) c) d) set e)

    60

    80

    ter

    20 e

    poc

    On Test set

    set

    20

    40

    Relative Relative G-transition

    Relative max-G

    Percentagedead

    Percentagestuck-ONcy

    [%, a

    ft

    0.1 1 100stochasticity

    0.01 0.1 1 10 1 10 1 3 10 30 100 1 3 10 30 100100

    slope variability conductances conductances

    ccur

    ac

    f) g) h) i) j)

    60

    80

    ted

    ac

    20

    40

    # of MNISTexamples between RESET

    A

    Mini-batchi

    RelativeRelative

    slope (f) of imul

    at

    1 10 100 10000 1 10 100 1 3 10 30 0.1 1 10 0.1 1 10RESETs Accuracy [%] size Learning rate neuron response

    p ( )

    Si

    Fig. 19 Matchedsimulations show thatan NVM-based NN ishighly robust to stochas-ticity, variable maxima,the presence of non-responsive devices, andinfrequent or inaccurateRESETs. Mini-batchsize = 1 avoids the needto accumulate weight up-dates before applyingthem. However, thenonlinear and asymmet-ric G-response limits ac-curacy to 85%, and re-quire learning-rate andneuron-response (f ) tobe precisely tuned.

    100%] Linear, bounded,

    80

    100

    cy [% symmetric, bi-directional

    60

    80

    cura Non-linear, bounded, symmetric, bi-directional

    asymmetric40

    60d

    acc

    Linear, bounded,

    nse

    asymmetric, bi-directional

    20

    40

    ulat

    e

    eG

    -res

    po

    0

    20 TrainingepochSi

    mu

    Rela

    tive

    Decreasing G-responseIncreasing G-response

    0 5 10 15 200

    S

    Fig. 20 NN performance is improved if G-response is linear and symmetric (green curve)rather than nonlinear (red). However, asym-metry between the up- and down-going G-responses (blue), if not corrected in the weight-update rule (Fig.7), can strongly degrade per-formance by favoring particular regions of theG-diamond (Figs.10,11).

    7 [ 3pJ SET 30pJ RESET30

    gy[m

    J] PCM[ 3pJ SET, 30pJ RESET]

    6

    7

    mJ] PCM

    [ 3pJ SET, 30pJ RESETapplied each 50 examples]

    3

    10

    ng E

    nerg

    5

    rgy

    [m

    10 30 100 300 1000

    1 Trai

    ni

    # of MNISTExamples between RESETs

    4

    Ener

    10 30 100 300 1000

    2

    3

    ning

    Bi-directionalNVM

    1

    2

    Trai

    n NVM[ 1pJ SET & RESET ]

    0.1 0.2 0.5 1 2 50

    T

    Learning rate

    Fig. 21 Despite the higherpower involved in RESETrather than partial-SET (30pJand 3pJ for highly-scaledPCM [1]), total energy costsof training can be minimizedif RESETs are sufficiently in-frequent (inset). Low-energytraining requires low learn-ing rates, which minimizesthe number of synaptic pro-gramming pulses. At higherlearning rates, even a bi-directional, linear RRAM re-quiring no RESET and offer-ing low-power (1pJ per pulse)can lead to large training en-ergy.

    Table of conclusionsLarge 3-layer network with 2 PCM Moderately high accuracy (82%) devices/synapse. (Figs. 1-3, 5) Back-propagation weight update rule compatible with crossbar array. (Figs. 6-8)

    achieved on MNIST handwritten digit recognition with two training epochs. (Fig. 16)

    NVM models identified issues for training: Conductance bounds, nonlinearity, and asymmetry must be considered. (Figs. 9-10, 20)

    PCM response and asymmetry mitigated by RESET strategy, mappingof response, choice of update pulse. (Figs. 11, 12, 17)be considered. (Figs. 9 10, 20) (Figs. 11, 12, 17)

    Model of PCM allows well-matched simulation of experiment (Figs. 16-18), variation of network parameters allows tolerancing (Fig 19)

    NN is resilient to NVM variations (Figs. 19a-e) and RESET strategy (Figs. 19f-g), but sensitive to learning rate and neuron response function (Figs 19i j)allows tolerancing. (Fig. 19) neuron response function. (Figs. 19i-j)

    Bidirectional NVM with no special RESET strategy and good performance requires scheme for

    For PCM, keeping RESET frequencydown and learning rate above freeze-out threshold allows reasonable

    symmetric response. (Fig. 20) training energy. (Fig. 21)

    Fig. 22 NN built with NVM-based synapsestend to be highly sensitive to gradienteffects (nonlinearity and asymmetry in G-response) that steer all synaptic weightstowards either high or low values, yet arehighly resilient to random effects (NVMvariability, yield, and stochasticity).

    References[1] B. Jackson et al., ACM J.

    Emerg. Tech. Comput.Syst., 9(2), 12 (2013).

    [2] M. Suri et al., IEDM Tech.Digest, 4.4 (2011).

    [3] D. Rumelhart et al., Paral-lel Distributed Processing,MIT Press (1986).

    [4] Y. LeCun et al., Proc.IEEE, 86(11), 2278 (1998).

    [5] G. Hinton et al., Science,313(5786) 504 (2006).

    [6] M. Breitwisch et al., VLSITech. Symp., T6B-3 (2007).

    [7] B. Rajendran et al., IEEETrans. Elect. Dev.,60(1), 246 (2013).