neural networks and optimization - diens1 introduction 2 deep networks 3 optimization 4...

Post on 04-Jun-2020

13 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Neural networks and optimization

Nicolas Le Roux

Criteo

18/05/15

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85

1 Introduction

2 Deep networks

3 Optimization

4 Convolutional neural networks

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 2 / 85

1 Introduction

2 Deep networks

3 Optimization

4 Convolutional neural networks

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 3 / 85

Goal : classification and regression

Medical imaging : cancer or not ? Classification

Autonomous driving : optimal wheel position Regression

Kinect : where are the limbs ? Regression

OCR : what are the characters ? Classification

It also works for structured outputs

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85

Goal : classification and regression

Medical imaging : cancer or not ? Classification

Autonomous driving : optimal wheel position Regression

Kinect : where are the limbs ? Regression

OCR : what are the characters ? Classification

It also works for structured outputs

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85

Object recognition

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 5 / 85

Object recognition

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 6 / 85

1 Introduction

2 Deep networks

3 Optimization

4 Convolutional neural networks

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 7 / 85

Linear classifier

Dataset :(X (i),Y (i)

)pairs, i = 1, . . . ,N.

X (i) ∈ Rn, Y (i) ∈ {−1,1}.Goal : Find w and b such that sign(w>X (i) + b) = Y (i).

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 8 / 85

Linear classifier

Dataset :(X (i),Y (i)

)pairs, i = 1, . . . ,N.

X (i) ∈ Rn, Y (i) ∈ {−1,1}.Goal : Find w and b such that sign(w>X (i) + b) = Y (i).

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 8 / 85

Perceptron algorithm (Rosenblatt, 57)

w0 = 0, b0 = 0

Y (i) = sign(w>X (i) + b)

wt+1 ← wt +∑

i

(Y (i) − Y (i)

)X (i)

bt+1 ← bt +∑

i

(Y (i) − Y (i)

)Linearly separable demo

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 9 / 85

Some data are not separable

The Perceptron algorithm is NOT convergent for non linearlyseparable data.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 10 / 85

Non convergence of the perceptron algorithm

Non-linearly separable perceptron

We need an algorithm which works both on separable and

non separable data.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 11 / 85

Classification error

A good classifier minimizes

`(w) =∑

i

`i

=∑

i

`(

Y (i),Y (i))

=∑

i

1sign(w>X (i)+b)6=Y (i)

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 12 / 85

Cost function

Classification error is not smooth.

Sigmoid is smooth but not convex.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85

Cost function

Classification error is not smooth.

Sigmoid is smooth but not convex.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85

Convex cost functions

Classification error is not smooth.

Sigmoid is smooth but not convex.

Logistic loss is a convex upper bound.

Hinge loss (SVMs) is very much like logistic.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85

Convex cost functions

Classification error is not smooth.

Sigmoid is smooth but not convex.

Logistic loss is a convex upper bound.

Hinge loss (SVMs) is very much like logistic.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85

Interlude

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 15 / 85

Convex function

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 16 / 85

Non-convex function

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 17 / 85

Not-convex-but-still-OK function

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 18 / 85

End of interlude

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 19 / 85

Solving separable AND non-separable

problems

Non-linearly separable logistic

Linearly separable logistic

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 20 / 85

Non-linear classification

Non-linearly separable polynomial

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 21 / 85

Non-linear classification

Features : X1,X2 → linear classifier

Features : X1,X2,X1X2,X 21 , . . .→ non-linear classifier

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 22 / 85

Choosing the features

To make it work, I created lots of extra features :

(X i1X j

2) for i , j ∈ [0,4] (p = 18)

Would it work with fewer features ?

Test with (X i1X j

2) for i , j ∈ [1,3]

Non-linearly separable polynomial degree 2

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85

Choosing the features

To make it work, I created lots of extra features :

(X i1X j

2) for i , j ∈ [0,4] (p = 18)

Would it work with fewer features ?

Test with (X i1X j

2) for i , j ∈ [1,3]

Non-linearly separable polynomial degree 2

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85

A graphical view of the classifiers

f (X ) = w1X1 + w2X2 + b

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85

A graphical view of the classifiers

f (X ) = w1X1 + w2X2 + w3X 21 + w4X 2

2 + w5X1X2 + . . .

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85

Non-linear features

A linear classifier on a non-linear transformation is

non-linear.

A non-linear classifier relies on non-linear features.

Which ones do we choose ?

Example : Hj = X pj1 X qj

2

SVM : Hj = K (X ,X (j)) with K some kernel function

Do they have to be predefined ?

A neural network will learn the Hj ’s

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features

A linear classifier on a non-linear transformation is

non-linear.

A non-linear classifier relies on non-linear features.

Which ones do we choose ?

Example : Hj = X pj1 X qj

2

SVM : Hj = K (X ,X (j)) with K some kernel function

Do they have to be predefined ?

A neural network will learn the Hj ’s

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features

A linear classifier on a non-linear transformation is

non-linear.

A non-linear classifier relies on non-linear features.

Which ones do we choose ?

Example : Hj = X pj1 X qj

2

SVM : Hj = K (X ,X (j)) with K some kernel function

Do they have to be predefined ?

A neural network will learn the Hj ’s

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features

A linear classifier on a non-linear transformation is

non-linear.

A non-linear classifier relies on non-linear features.

Which ones do we choose ?

Example : Hj = X pj1 X qj

2

SVM : Hj = K (X ,X (j)) with K some kernel function

Do they have to be predefined ?

A neural network will learn the Hj ’s

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

Non-linear features

A linear classifier on a non-linear transformation is

non-linear.

A non-linear classifier relies on non-linear features.

Which ones do we choose ?

Example : Hj = X pj1 X qj

2

SVM : Hj = K (X ,X (j)) with K some kernel function

Do they have to be predefined ?

A neural network will learn the Hj ’s

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85

A 3-step algorithm for a neural network

1 Pick an example x

2 Transform it in x = Vx with some matrix V

3 Compute w>x + b

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network

1 Pick an example x

2 Transform it in x = Vx with some matrix V

3 Compute w>x + b

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network

1 Pick an example x

2 Transform it in x = Vx with some matrix V

3 Compute w>x + b

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85

A 3-step algorithm for a neural network

1 Pick an example x

2 Transform it in x = Vx with some matrix V

3 Compute w>x + b = w>Vx + b = wx + b

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 27 / 85

A 3-step algorithm for a neural network

1 Pick an example x

2 Transform it in x = Vx with some matrix V

3 Apply a nonlinear function g to all the elements of x

4 Compute w>x + b = w>g(Vx) + b 6= wx + b

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 28 / 85

Neural networks

Usually, we use

Hj = g(v>j X )

Hj : Hidden unit

vj : Input weight

g : Transfer function

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85

Neural networks

Usually, we use

Hj = g(v>j X )

Hj : Hidden unit

vj : Input weight

g : Transfer function

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85

Transfer function

f (X ) =∑

j

wjHj(X ) + b =∑

j

wjg(v>j X ) + b

g is the transfer function.

g used to be the sigmoid or the tanh.

If g is the sigmoid, each hidden unit is a soft classifier.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 30 / 85

Transfer function

f (X ) =∑

j

wjHj(X ) + b =∑

j

wjg(v>j X ) + b

g is the transfer function.

g is now often the positive part.

This is called a rectified linear unit.Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 31 / 85

Neural networks

f (X ) =∑

j

wjHj(X ) + b =∑

j

wjg(v>j X ) + b

Each hidden unit is a (soft) linear classifier.

We linearly combine these classifiers to get the output.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 32 / 85

Example on the non-separable problem

Non-linearly separable neural net 3

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 33 / 85

Training a neural network

Dataset :(X (i),Y (i)

)pairs, i = 1, . . . ,N.

Goal : Find w and b such that

sign(w>X (i) + b

)= Y (i)

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Training a neural network

Dataset :(X (i),Y (i)

)pairs, i = 1, . . . ,N.

Goal : Find w and b to minimize∑i

log(1 + exp

(−Y (i) (w>X (i) + b

)))

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Training a neural network

Dataset :(X (i),Y (i)

)pairs, i = 1, . . . ,N.

Goal : Find v1, . . . , vk , w and b to minimize

∑i

log

1 + exp

−Y (i)

∑j

wjg(v>

j X (i))

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85

Cost function

s = cost function (logistic loss, hinge loss, ...)

`(v ,w ,b,X (i),Y (i)) = s(

Y (i),Y (i))

= s

∑j

wjHj(X (i)),Y (i)

= s

∑j

wjg(v>

j X (i)) ,Y (i)

We can minimize it using gradient descent.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85

Cost function

s = cost function (logistic loss, hinge loss, ...)

`(v ,w ,b,X (i),Y (i)) = s(

Y (i),Y (i))

= s

∑j

wjHj(X (i)),Y (i)

= s

∑j

wjg(v>

j X (i)) ,Y (i)

We can minimize it using gradient descent.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85

Finding a good transformation of x

Hj(x) = g(v>j x)

Is it a good Hj ? What can we say about it ?

It is theoretically enough (universal approximation)

That does not mean you should use it

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Finding a good transformation of x

Hj(x) = g(v>j x)

Is it a good Hj ? What can we say about it ?

It is theoretically enough (universal approximation)

That does not mean you should use it

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Finding a good transformation of x

Hj(x) = g(v>j x)

Is it a good Hj ? What can we say about it ?

It is theoretically enough (universal approximation)

That does not mean you should use it

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85

Going deeper

Hj(x) = g(v>j x)

xj = g(v>j x)

We can transform x as well.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85

Going deeper

Hj(x) = g(v>j x)

xj = g(v>j x)

We can transform x as well.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85

Going from 2 to 3 layers

We can have as many layers as we like

But it makes the optimization problem harder

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

Going from 2 to 3 layers

We can have as many layers as we like

But it makes the optimization problem harder

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

Going from 2 to 3 layers

We can have as many layers as we like

But it makes the optimization problem harder

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85

1 Introduction

2 Deep networks

3 Optimization

4 Convolutional neural networks

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 39 / 85

The need for fast learning

Neural networks may need many examples (several millions

or more).

We need to be able to use them quickly.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 40 / 85

Batch methods

L(θ) =1N

∑i

`(θ,X (i),Y (i))

θt+1 → θt −αt

N

∑i

∂`(θ,X (i),Y (i))

∂θ

To compute one update of the parameters, we need to go

through all the data.

This can be very expensive.

What if we have an infinite amount of data ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 41 / 85

Potential solutions

1 Discard data.

I Seems stupid

I Yet many people do it

2 Use approximate methods.

I Update = average of the updates for all datapoints.

I Are these update really different ?

I If not, how can we learn faster ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85

Potential solutions

1 Discard data.

I Seems stupid

I Yet many people do it

2 Use approximate methods.

I Update = average of the updates for all datapoints.

I Are these update really different ?

I If not, how can we learn faster ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85

Stochastic gradient descent

L(θ) =1N

∑i

`(θ,X (i),Y (i))

θt+1 → θt −αt

N

∑i

∂`(θ,X (i),Y (i))

∂θ

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Stochastic gradient descent

L(θ) =1N

∑i

`(θ,X (i),Y (i))

θt+1 → θt − αt∂`(θ,X (it ),Y (it ))

∂θ

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Stochastic gradient descent

L(θ) =1N

∑i

`(θ,X (i),Y (i))

θt+1 → θt − αt∂`(θ,X (it ),Y (it ))

∂θ

What do we lose when updating the parameters to satisfy justone example ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85

Disagreement

‖µ‖2/σ2 during optimization

(log scale)

As optimization progresses,

disagreement increases

Early on, one can pick one

example at a time

What about later ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 44 / 85

Stochastic vs. batch methodsStochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

time

log(

exce

ss c

ost)

stochastic

deterministic

For non-convex problem, stochastic works best.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85

Stochastic vs. batch methodsStochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

time

log(

exce

ss c

ost)

stochastic

deterministic

For non-convex problem, stochastic works best.Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85

Understanding the loss function of deep

networks

Recent studies say most local minima are close to the

optimum.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Understanding the loss function of deep

networks

Recent studies say most local minima are close to the

optimum.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Understanding the loss function of deep

networks

Recent studies say most local minima are close to the

optimum.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85

Regularizing deep networks

Deep networks have many parameters

How do we avoid overfitting ?

I Regularizing the parameters

I Limiting the number of units

I Dropout

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks

Deep networks have many parameters

How do we avoid overfitting ?

I Regularizing the parameters

I Limiting the number of units

I Dropout

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks

Deep networks have many parameters

How do we avoid overfitting ?

I Regularizing the parameters

I Limiting the number of units

I Dropout

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Regularizing deep networks

Deep networks have many parameters

How do we avoid overfitting ?

I Regularizing the parameters

I Limiting the number of units

I Dropout

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85

Dropout

Overfitting happens when each unit gets too specialized

The core idea is to prevent each unit from relying too much

on the others

How do we do this ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout

Overfitting happens when each unit gets too specialized

The core idea is to prevent each unit from relying too much

on the others

How do we do this ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout

Overfitting happens when each unit gets too specialized

The core idea is to prevent each unit from relying too much

on the others

How do we do this ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85

Dropout - an illustration

Figure taken from Dropout : A Simple Way to Prevent Neural Networks from Overfitting

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 49 / 85

Dropout - a conclusion

By removing units at random, we force the others to be less

specialized

At test time, we use all the units

This is sometimes presented as an ensemble method.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85

Dropout - a conclusion

By removing units at random, we force the others to be less

specialized

At test time, we use all the units

This is sometimes presented as an ensemble method.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85

Neural networks - Summary

A linear classifier in a feature space can model non-linear

boundaries.

Finding a good feature space is essential.

One can design the feature map by hand.

One can learn the feature map, using fewer features than if

it done by hand.

Learning the feature map is potentially HARD

(non-convexity).

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 51 / 85

Recipe for training a neural network

Use rectified linear units

Use dropout

Use stochastic gradient descent

Use 2000 units on each layer

Try different number of layers

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 52 / 85

1 Introduction

2 Deep networks

3 Optimization

4 Convolutional neural networks

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 53 / 85

vj ’s for images

f (X ) =∑

j

wjHj(X ) + b =∑

j

wjg(v>j X ) + b

If X is an image, vj is an image too.

vj acts as a filter (presence or absence of a pattern).

What does vj look like ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 54 / 85

vj ’s for images - Examples

Filters are mostly local

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85

vj ’s for images - Examples

Filters are mostly localNicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85

Basic idea of convolutional neural networks

Filters are mostly local.

Instead of using image-wide filters, use small ones over

patches.

Repeat for every patch to get a response image.

Subsample the response image to get local invariance.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 56 / 85

Filtering - Filter 1

Original image Filter Output image

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 57 / 85

Filtering - Filter 2

Original image Filter Output image

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 58 / 85

Filtering - Filter 3

Original image Filter Output image

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 59 / 85

Pooling - Filter 1

Original image Output image Subsampled image

How to do 2x subsampling-pooling :

Output image = O, subsampled image = S.

Sij = maxk over window around (2i,2j) Ok .

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 60 / 85

Pooling - Filter 2

Original image Output image Subsampled image

How to do 2x subsampling-pooling :

Output image = O, subsampled image = S.

Sij = maxk over window around (2i,2j) Ok .

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 61 / 85

Pooling - Filter 3

Original image Output image Subsampled image

How to do 2x subsampling-pooling :

Output image = O, subsampled image = S.

Sij = maxk over window around (2i,2j) Ok .

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 62 / 85

A convolutional layer

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 63 / 85

Transforming the data with a layer

Original datapoint New datapoint

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 64 / 85

A convolutional network

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 65 / 85

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 66 / 85

Face detection

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 67 / 85

Face detection

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 68 / 85

NORB dataset

50 toys belonging to 5 categories

I animal, human figure, airplane, truck, car

10 instance per category

I 5 instances used for training, 5 instances for testing

Raw dataset

I 972 stereo pairs of each toy. 48,600 image pairs total.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 69 / 85

NORB dataset - 2

For each instance :

18 azimuths

0 to 350 degrees every 20 degrees

9 elevations

30 to 70 degrees from horizontal every 5 degrees

6 illuminations

on/off combinations of 4 lights

2 cameras (stereo), 7.5 cm apart

40 cm from the objectNicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 70 / 85

NORB dataset - 3

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 71 / 85

Textured and cluttered versions

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 72 / 85

90,857 free parameters, 3,901,162 connections.

The entire network is trained end-to-end (all the layers are

trained simultaneously).Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 73 / 85

Normalized-Uniform set

Method ErrorLinear Classifier on raw stereo images 30.2%K-Nearest-Neighbors on raw stereo images 18.4%K-Nearest-Neighbors on PCA-95 16.6%Pairwise SVM on 96x96 stereo images 11.6%Pairwise SVM on 95 Principal Components 13.3%Convolutional Net on 96x96 stereo images 5.8%

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 74 / 85

Jittered-Cluttered Dataset

291,600 stereo pairs for training, 58,320 for testing

Objects are jittered

I position, scale, in-plane rotation, contrast, brightness,

backgrounds, distractor objects,...

Input dimension : 98x98x2 (approx 18,000)Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 75 / 85

Jittered-Cluttered Dataset - Results

Method ErrorSVM with Gaussian kernel 43.3%Convolutional Net with binocular input 7.8%Convolutional Net + SVM on top 5.9%Convolutional Net with monocular input 20.8%Smaller mono net (DEMO) 26.0%

Dataset available from http://www.cs.nyu.edu/˜yann

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 76 / 85

NORB recognition - 1

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 77 / 85

NORB recognition - 2

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 78 / 85

ConvNet summary

With complex problems, it is hard to design features by

hand.

Neural networks circumvent this problem.

They can be hard to train (again...).

Convolutional neural networks use knowledge about locality

in images.

They are much easier than standard networks.

And they are FAST (again...).

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 79 / 85

What has not been covered

In some cases, we have lots of data, but without the labels.

Unsupervised learning.

There are techniques to use these data to get better

performance.

E.g. : Task-Driven Dictionary Learning, Mairal et al.

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 80 / 85

Additional techniques currently used

Local response normalization

Data augmentation

Batch gradient normalization

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 81 / 85

Adversarial examples

ConvNets achieve stellar performance in object recognition

Do they really understand what’s going on in an image ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 82 / 85

Adversarial examples

ConvNets achieve stellar performance in object recognition

Do they really understand what’s going on in an image ?

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 82 / 85

What I did not cover

Algorithms for pretraining (RBMs, autoencoders)

Generative models (DBMs)

Recurrent neural networks for machine translation

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 83 / 85

Conclusions

Neural networks are complex non-linear models

Optimizing them is hard

It is as much about theory as about engineering

When properly tuned, they can perform wonders

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 84 / 85

Bibliography

Deep Learning book, http://www-labs.iro.umontreal.ca/˜bengioy/DLbook/

Dropout : A Simple Way to Prevent Neural Networks from Overfitting by Srivistava et al.

Intriguing properties of neural networks by Szegedy et al.

ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky,

Sutskever and Hinton

Gradient-based learning applied to document recognition by LeCun et al.

Qualitatively characterizing neural network optimization problems by Goodfellow and

Vinyals

Sequence to Sequence Learning with Neural Networks by Sutskever, Vinyals and Le

Caffe, http://caffe.berkeleyvision.org/

Hugo Larochelle’s MOOC, https:

//www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH

Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 85 / 85

top related