on-line learning architectures

Tarek Taha1

On-line Learning Architectures

Tarek M. Taha

Electrical and Computer Engineering Department University of Dayton

March 2017

Tarek Taha2

Device SPICE Model

C. Yakopcic, T. M. Taha, G. Subramanyam, and R. E. Pino, "Memristor SPICE Model and Crossbar Simulation Based on Devices with Nanosecond Switching Time," IEEE International JointConference on Neural Networks (IJCNN), August 2013. [BEST PAPER AWARD]

0 0.1 0.2 0.30

0.5

1

Voltage (V)

Cur

rent

(mA

)

-

Vol

tage

(V)

1 2 3 40

0.2

0.4

0.6

0.8

1

Cur

rent

( µA

)

-2 -1 0-0.4

-0.3

-0.2

-0.1

0

Voltage (V) -1.5 -1 -0.5 0 0.5 1 1.5-40

-20

0

20

40

60

Voltage (V)

Cur

rent

( µA

)

Univ of MichiganBoise State HP Labs

SPIC

E M

odel

Act

ual

Dev

ice

Tarek Taha3

Memristor Backpropagation Circuit

A B C

F1 F2

Online training via backpropagation circuits

The same memristor crossbar implements both forward and backward passes

Training Unit (L2)

CB

βC

AAB

β

ββ

inpu

ts

𝛿𝐿1,6

𝛿𝐿1,1

. . .

+ -

+ -+ -+ -+ -

+ -

ctrl2ctrl1

∑ target_2+

-

δL2,2

output_2

∑-

δL2,1

+target_1

output_1

Training Unit (L1)

𝛿𝐿2,1 𝛿𝐿2,2

Training Unit (L2)

CB

βC

AAB

β

ββ

inpu

ts

𝛿𝐿1,6

𝛿𝐿1,1

. . .

+ -

+ -+ -+ -+ -

+ -

ctrl2ctrl1

∑ target_2+

-

δL2,2

output_2

∑-

δL2,1

+target_1

output_1

Training Unit (L1)

𝛿𝐿2,1 𝛿𝐿2,2

Forward Pass Backward Pass

Tarek Taha4

Socrates: Online Training Architecture

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

3D stack eDRAM

Router

Multicore processing system

Implements backpropagation based online training

Two versions: 1) memristor core and 2) fully digital core

Has three 3D stack memory for storing training dataCB

βC

AAB

β

inpu

ts

Memristor learning core

Digital learning core

--OR--

Tarek Taha5

FPGA Implementation

NC = Neural CoreR = Router

Router

Neural Core

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

NC

R

OutIn

Input portOutput port

Routing bus

8 bitsS S S S

SS S S

S SS S

S

S S S S

S S SS

Router

Actual images to and out of FPGA based multicore neural processor:

We verified the system functionality by implementing the digital system on an FPGA.

Tarek Taha6

Training Efficiency

Accuracy(%) GPU Digital Memristor

MNIST 99% 97% 97%

COIL-20 100% 98% 97%

COIL-100 99% 99% 99%

Training compared to GTX980Ti GPU

Energy Efficiency:- Digital: 5900x- Memristor: 70000x

Speedup:- Digital: 14x- Memristor: 7x- Batch processing on GPU but not on specialized processors

Tarek Taha7

Inference Efficiency

Inference compared to GTX980Ti GPU

Energy Efficiency:- Digital: 7000x - Memristor: 358000x

Speedup:- Digital: 29x- Memristor: 60x

Memristor learning core is about 35x smaller in area:

- Digital: 0.52 mm2

- Memristor: 0.014 mm2

Tarek Taha8

Large Crossbar Simulation

0 2 4 6 8

x 104

0.9

0.92

0.94

0.96

0.98

1

Memristor

Pot

entia

l acr

oss

mem

risto

r (V

)

0 2 4 6 8

x 104

0.9

0.92

0.94

0.96

0.98

1

Memristor

Pote

ntia

l acr

oss

mem

risto

r (V)

SPICE 300×300 MATLAB 300×300

y1

.

.

.

Inpu

ts

. . .

yN/2

VM=1 V

V1=1 V

yj

RRf

R+ -

+ -

Studying training using memristor crossbars needs to be done in SPICE to account for circuit parasitics.

Simulating one cycle of a large crossbar in can take over 24 hours in SPICE. (training requires thousands of iterations and hence is nearly impossible in SPICE).

Developed a fast (~30ms) algorithm to approximate SPICE output with over 99% accuracy.

Enables us to examine learning on memristor crossbars.

Tarek Taha9

Memristor for On-Line Learning

On-line learning requires “highly tunable” or “slow” memristors This does not affect inference speed

Ron

Roff

10ns

Ron

Roff

10ns

500ns

Memory Memristor

On-line Learning Memristor

Tarek Taha10

LiNbO3 Memristor for Online Learning

Device is quite slow: order of tens of millisecond switching time.

Useful for backprogation training.

-3 -2 -1 0 1 2 3

-10

-5

0

5

10

Voltage (V)

Cur

rent

(mA

)

Approx. Switching Time

Switching Voltage

Minimum Pulse Width Required

10ns 6 1ps

10ns 4.5 10ps

100ns 4.5 100ps

1µs 4.5 1ns

10µs 4.5 10ns

Switching speed of devices and minimum programming pulse needed in neuromorphic training circuits.

0 500 1000 1500 2000-9.5

-9

-8.5

-8

-7.5

-7

-6.5

Pulse Number

Cur

rent

(mA

)

0 20 40 60 80 1002

3

4

5

6

Pulse Number

Cur

rent

(mA

)

Tarek Taha11

Unsupervised ClusteringBefore Training After Training

Wisconsin

Iris

Wine

0.2 0.25 0.3 0.35 0.40.45

0.5

0.55

0.6

0.65

0.7

0.75

BenignMalignant

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

BenignMalignant

0.8 0.81 0.82 0.830.5

0.55

0.6

0.65

0.7

SetosaVersicolorVirginica

-1 -0.8 -0.6 -0.4 -0.2 0-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

SetosaVersicolorVirginica

0.2 0.25 0.3 0.350.5

0.55

0.6

0.65

0.7

Class1Class2Class3

-1 -0.5 0 0.50.2

0.4

0.6

0.8

1

Class1Class2Class3

x1

x2

x3

x4

x5

x6

1

n1

n2

n3

1

x1’

x2’

x3’

x4’

x5’

x6’

Layer L1 Layer L2

We extended the backpropagation circuit to implement auto-encoder based unsupervised learning in memristor crossbars.

These graphs show the circuit learning and clustering data in an unsupervised manner.

Tarek Taha12

Application: Cybersecurity

0 500 1000 1500 2000 2500 30000

0.2

0.4

0.6

0.8

1

1.2

Normal packet

Dis

tanc

e be

tn in

put &

reco

nstru

ctio

n

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

1.2

Anomalous packet

Dis

tanc

e be

tn in

put &

reco

nstru

ctio

n

0 20 40 60 80 1000

20

40

60

80

100

False detection (%)

Det

ectio

n ra

te (%

)

The autoencoder based unsupervised training circuits were used to learn normal network packets.

When the memristor circuits were presented with anomalous network packets, these were detected with a 96.6% accuracy and a 4% false positive rate.

Could also implement rule based malicious packet detection (similar to SNORT) in a very compact circuit

Regex : user\s[^\n]{10}

Tarek Taha13

Application: Autonomous Agent for UAVs

IBM TrueNorth

Exploring implementation on: IBM TrueNorth processor Memristor circuits High performance clusters: CPUs and

GPUs

Cognitively Enhanced Complex Event Processing (CECEP) Architecture

Geared towards autonomous decision making in a variety of applications including autonomous UAVs

Event Output StreamsIO “Adapters”

CB

βC

AAB

β

inpu

ts

MemristorSystems

Tarek Taha14

Parallel Cognitive Systems LaboratoryResearch Engineers:• Raqib Hasan, PhD• Chris Yakopcic, PhD• Wei Song, PhD• Tanvir Atahary, PhD

Sponsors:

http://taha-lab.org

Doctoral Students:• Zahangir Alom • Rasitha Fernando• Ted Josue• Will Mitchell• Yangjie Qi• Nayim Rahman• Stefan Westberg• Ayesha Zaman• Shuo Zhang

Master’s Students:• Omar Faruq• Tom Mealy

on-line learning architectures

Documents