neural networks - computer science division | …ddowney/courses/349_spring2017/lectures/... ·...

82
Machine Learning Neural Networks (slides from Domingos, Pardo, others)

Upload: vanxuyen

Post on 10-Apr-2018

221 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Machine Learning

Neural Networks

(slides from Domingos, Pardo, others)

Page 2: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Reading

• For this week,

– Chapter 4: Neural Networks (Mitchell, 1997)

– See Canvas

• For subsequent weeks:

– Scaling Learning Algorithms toward AI

– Learning Deep Architectures for AI

Page 3: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Human Brain

Page 4: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Neurons

Page 5: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Input-Output Transformation

Input

Spikes

Output

Spike

(Excitatory Post-Synaptic Potential)

Spike (= a brief pulse)

Page 6: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Human Learning

• Number of neurons: ~ 1011

• Connections per neuron: ~ 103 to 105

• Neuron switching time: ~ 0.001 second

• Scene recognition time: ~ 0.1 second

100 inference steps doesn’t seem much

Page 7: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Machine Learning Abstraction

Page 8: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Artificial Neural Networks

• Typically, machine learning ANNs are veryartificial, ignoring:– Time

– Space

– Biological learning processes

• More realistic neural models exist– Hodgkin & Huxley (1952) won a Nobel prize

for theirs (in 1963)

• Nonetheless, very artificial ANNs have been useful in many ML applications

Page 9: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Perceptrons

• The “first wave” in neural networks

• Big in the 1960’s

– McCulloch & Pitts (1943), Woodrow & Hoff (1960), Rosenblatt (1962)

Page 10: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Perceptrons

• Problem def:

– Let f be a target function from X = <x1, x2, …> where xi {0, 1}toy {0, 1}

– Given training data {(X1, y1), (X2, y2)…}

• Learn h (X ), an approximation of f (X )

Page 11: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

A single perceptron

else 0

0 if 10

n

i

ii xw

w1

w3

w2

w4

w5

x1

x2

x3

x4

x5

x0

w0

Inp

uts

Bias (x0 =1,always)

Page 12: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Logical Operators

-0.8

0.5

0.5

else 0

0 if 10

n

i

ii xw

x0

x1

x2

AND

-0.3

0.5

0.5

else 0

0 if 10

n

i

ii xw

x0

x1

x2

OR

0.1

-1.0

else 0

0 if 10

n

i

ii xw

x0

x1

NOT

Page 13: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Learning Weights

• Perceptron Training Rule

• Gradient Descent

• (other approaches: Genetic Algorithms)

else 0

0 if 10

n

i

ii xw

?x0

?x1

?x2

Page 14: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Perceptron Training Rule

• Weights modified for each training example

• Update Rule:

iii www

ii xotw )(

where

learning

rate

target

value

perceptron

output

input

value

Page 15: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Perception Training for NOT

Initialize:

Bryan Pardo, Machine Learning: EECS 349 Fall 2009 15

iii www

ii xotw )(

else 0

0 if 10

n

i

ii xw

x0

x1

NOT

WorkStart End

w1

w0

0, 10 ww

Page 16: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

What weights make XOR?

• No combination of weights works

• Perceptrons can only represent linearly separable functions

else 0

0 if 10

n

i

ii xw

?

x0

?

x1

?x2

Page 17: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Linear Separability

x1

x2

OR

Page 18: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Linear Separability

x1

x2

AND

Page 19: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Linear Separability

x1

x2

XOR

Page 20: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Perceptron Training Rule

• Converges to the correct classification IF

– Cases are linearly separable

– Learning rate is slow enough

– Proved by Minsky and Papert in 1969

Killed widespread interest in perceptrons till the 80’s

Page 21: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

XOR

else 0

0 if 10

n

i

ii xw

0

x0

0.6x1

0.6x2

else 0

0 if 10

n

i

ii xw

0

x0

else 0

0 if 10

n

i

ii xw

0

x0

XOR1

1

-0.6

-0.6

Page 22: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

What’s wrong with perceptrons?

• You can always plug multiple perceptrons together to calculate any function.

• BUT…who decides what the weights are?

– Assignment of error to parental inputs becomes a problem….

Page 23: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Perceptrons use a step function

else 0

0 if 10

n

i

ii xw

?

x0

?

x1

?x2

Perceptron Threshold

Step function

• Small changes in inputs -> either no change or large change in output.

Page 24: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Solution: Differentiable Function

n

i

ii xw0

?

x0

?

x1

?x2

Simple linear function

• Varying any input a little creates a perceptible change in the output

• We can now characterize how errorchanges wi even in multi-layer case

Page 25: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Measuring error for linear units

• Output Function

• Error Measure:

xwx

)(

Dd

dd otwE 2)(2

1)(

data target

value

linear unit output

Page 26: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Gradient Descent

Gradient:

Training rule:

Page 27: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Gradient Descent Rule

Dd

dd

ii

otww

E 2)(2

1

Dd

didd xot ))(( ,

Dd

diddii xotww ,)(

Update Rule:

Page 28: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Gradient Descent for Multiple Layers

x0

x1

x2

x0

x0

XOR

n

i

ii xw0

n

i

ii xw0

n

i

ii xw0

ijw

We can compute:

ijw

E

Page 29: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Gradient Descent vs. Perceptrons

• Perceptron Rule & Threshold Units– Learner converges on an answer ONLY IF

data is linearly separable

– Can’t assign proper error to parent nodes

• Gradient Descent– (locally) Minimizes error even if examples are

not linearly separable

– Works for multi-layer networks• But…linear units only make linear decision surfaces

(can’t learn XOR even with many layers)

– And the step function isn’t differentiable…

Page 30: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

A compromise function• Perceptron

• Linear

• Sigmoid (Logistic)

else 0

0 if 10

n

i

ii xwoutput

n

i

ii xwnetoutput0

netenetoutput

1

1)(

Page 31: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

The sigmoid (logistic) unit

• Has differentiable function

– Allows gradient descent

• Can be used to learn non-linear functions

?

x1

?x2

n

iii xw

e 01

1

Page 32: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Logistic function

Inputs

Coefficients

Output

Independent

variables

Prediction

Age 34

1Gender

Stage 4

.5

.8

.4

0.6

S“Probability

of beingAlive”

n

iii xw

e 01

1

Page 33: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Neural Network Model

Inputs

Weights

Output

Independent

variables

Dependent

variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHidden

Layer

“Probability

of beingAlive”

0.6

S

S

.4

.2S

Page 34: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Getting an answer from a NN

Inputs

Weights

Output

Independent

variables

Dependent

variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.1

.7

WeightsHidden

Layer

“Probability

of beingAlive”

0.6

S

Page 35: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Inputs

Weights

Output

Independent

variables

Dependent

variable

Prediction

Age 34

2Gender

Stage 4

.5

.8

.2

.3

.2

WeightsHidden

Layer

“Probability

of beingAlive”

0.6

S

Getting an answer from a NN

Page 36: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Getting an answer from a NN

Inputs

Weights

Output

Independent

variables

Dependent

variable

Prediction

Age 34

1Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHidden

Layer

“Probability

of beingAlive”

0.6

S

Page 37: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Minimizing the Error

winitial wtrained

initial error

final error

Error surface

positive change

negative derivative

local minimum

Page 38: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Differentiability is key!

• Sigmoid is easy to differentiate

• For gradient descent on multiple layers, a little dynamic programming can help:

– Compute errors at each output node

– Use these to compute errors at each hidden node

– Use these to compute weight gradient

))(1()()(

yyy

y

Page 39: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

The Backpropagation Algorithm

jikjiji

ji

k

outputsk

hkhhh

h

kkkkk

k

u

xδww

w

δwooδ

δ

otooδ

δk

u

ox

t,x

ight network weeach Update.4

)1(

error term its calculate h,unit hidden each For .3

))(1(

error term its calculate ,unit output each For 2.

network in the unit every for

output thecompute andnetwork the to instanceInput 1.

example, ninginput traieach For

Page 40: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Learning Weights

Inputs

Weights

Output

Independent

variables

Dependent

variable

Prediction

Age 34

1Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHidden

Layer

“Probability

of beingAlive”

0.6

S

Page 41: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

The fine print

• Don’t implement back-propagation

– Use a package

– Second-order or variable step-size optimization techniques exist

• Feature normalization

– Typical to normalize inputs to lie in [0,1]

• (and outputs must be normalized)

• Problems with NN training:

– Slow training times (though, getting better)

– Local minima

Page 42: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Minimizing the Error

winitial wtrained

initial error

final error

Error surface

positive change

negative derivative

local minimum

Page 43: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Expressive Power of ANNs

• Universal Function Approximator:

– Given enough hidden units, can approximate any continuous function f

• Need 2+ hidden units to learn XOR

• Why not use millions of hidden units?

– Efficiency (training is slow)

– Overfitting

Page 44: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Overfitting

Overfitted ModelReal Distribution

Page 45: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Combating Overfitting in Neural Nets

• Many techniques

• Two popular ones:

– Early Stopping (most popular)

• Use “a lot” of hidden units

• Just don’t over-train

– Cross-validation

• Test different architectures to choose “right” number of hidden units

Page 46: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Early Stopping

b = training set

a = validation set

Overfitted model

error

Epochs

min ( error)

errora

errorb

Stopping criterion

Page 47: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Learning Rate?

• A “knob” you twist empirically

– Important

• One popular option: look for validation set acc to decrease/stabilize, then halve learning rate

Page 48: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Modern Neural Networks (Deep Nets)

Local minima in large networks is less of an issue

Early stopping is useful, but so is initializing at zero and training until almost zero training error

count on stochastic gradient descent to perform “implicit regularization”

Also: Dropout

Many layers are now common

And specific structure: convolution, max pooling

Page 49: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Summary of Neural Networks

When are Neural Networks useful?

– Instances represented by attribute-value pairs

• Particularly when attributes are real valued

– The target function is

• Discrete-valued

• Real-valued

• Vector-valued

– Training examples may contain errors

– Fast evaluation times are necessary

When not?

– Fast training times are necessary

– Understandability of the function is required

Page 50: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Summary of Neural Networks

Non-linear regression technique that is trained with gradient descent.

Question: How important is the biological metaphor?

Page 51: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Advanced Topics in Neural Nets

• Batch Move vs. stochastic

• Auto-Encoders

• Deep Belief Nets (briefly)

• Neural Networks on Silicon

• Neural Network language models

Page 52: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Stochastic vs. Batch Mode

Stochastic Gradient Descent

Page 53: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Incremental vs. Batch Mode

• In Batch Mode we minimize:

• Same as computing:

• Then setting

Dd

dD ww

Dwww

Page 54: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Advanced Topics in Neural Nets

• Batch Move vs. incremental

• Auto-Encoders

• Deep Belief Nets (briefly)

• Neural Networks on Silicon

• Neural Network language models

Page 55: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Hidden Layer Representations

• Input->Hidden Layer mapping:

– representation of input vectors tailored to the task

• Can also be exploited for dimensionality reduction

– Form of unsupervised learning in which we output a “more compact” representation of input vectors

– <x1, …,xn> -> <x’1, …,x’m> where m < n

– Useful for visualization, problem simplification, data compression, etc.

Page 56: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Dimensionality Reduction

Model: Function to learn:

Page 57: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Dimensionality Reduction: Example

Page 58: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Dimensionality Reduction: Example

Page 59: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Dimensionality Reduction: Example

Page 60: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Dimensionality Reduction: Example

Page 61: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Advanced Topics in Neural Nets

• Batch Move vs. incremental

• Auto-encoders

• Deep Belief Nets (briefly)

• Neural Networks on Silicon

• Neural Network language models

Page 62: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Restricted Boltzman Machine

Page 63: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For
Page 64: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Auto-encoders vs. RBMs?

• Similar

• Auto-encoder (AE) goal is to reconstruct input in two steps, input->hidden->output

• RBM defines a probability distribution over P(x)

– Goal is to assign high likelihood to the observed training examples

– Determining likelihood of a given x actually requires summing over all possible settings of hidden nodes, rather than just computing a single activation as in AE

– Take EECS 395/495 Probabilistic Graphical Models to learn more

Page 65: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Deep Belief Nets

Page 66: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Advanced Topics in Neural Nets

• Batch Move vs. incremental

• Auto-Encoders

• Deep Belief Nets (briefly)

• Neural Networks on Silicon

• Neural Network language models

Page 67: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Neural Networks on Silicon

• Currently:

Simulation of continuous device physics (neural networks)

Digital computational model (thresholding)

Continuous device physics (voltage)

Why not

skip this?

Page 68: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Example: Silicon Retina

Simulates function of biological retina

Single-transistor synapses adapt to luminance, temporal contrast

Modeling retina directly on chip => requires 100x less power!

Page 69: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Example: Silicon Retina

• Synapses modeled with single transistors

Page 70: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Luminance Adaptation

Page 71: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Comparison with Mammal Data

• Real:

• Artificial:

Page 72: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

• Graphics and results taken from:

Page 73: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

General NN learning in silicon?

• People seem more excited about / satisfied with GPUs

Page 74: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Advanced Topics in Neural Nets

• Batch Move vs. incremental

• Hidden Layer Representations

• Hopfield Nets

• Neural Networks on Silicon

• Neural Network language models

Page 75: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Neural Network Language Models

• Statistical Language Modeling:

– Predict probability of next word in sequence

I was headed to Madrid , ____

P(___ = “Spain”) = 0.5,

P(___ = “but”) = 0.2, etc.

• Used in speech recognition, machine translation, (recently) information extraction

Page 76: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

• Estimate:

Formally

( 121 ,...,,| njjjj wwwwP

( jj hwP |

Page 77: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For
Page 78: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Optimizations

• Key idea – learn simultaneously:

– vector representations of each word (here 120 dim)

– predictor of next word. based on previous vectors

• Short-lists

– Much complexity in hidden->output layer

• Number of possible next words is large

– Only predict a subset of words

• Use a standard probabilistic model for the rest

– Can also bin words into fixed classes

Page 79: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Design Decisions (1)

• Number of hidden units

Page 80: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Design Decisions (2)

• Word representation (# of dimensions)

• They chose 120 – more recent work uses up to 1000s

Page 81: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

Comparison vs. state of the art

• Circa 2005

Schwenk, Holger, and Jean-Luc Gauvain. "Training neural network language

models on very large corpora." Proceedings of the conference on Human

Language Technology and Empirical Methods in Natural Language

Processing. Association for Computational Linguistics, 2005.

Page 82: Neural Networks - Computer Science Division | …ddowney/courses/349_Spring2017/lectures/... · Machine Learning Neural Networks (slides from Domingos, Pardo, others) Reading •For

More recent results

Chelba, Ciprian, et al. "One billion word benchmark for

measuring progress in statistical language modeling." arXiv

preprint arXiv:1312.3005 (2013).