deep learning and differentiable programmingdic.uqam.ca/upload/files/seminaires/deep learning and...

54
Deep Learning, differentiable programming, and software 2.0 (or white is the new black? ) Mounir Boukadoum UQAM, Dep. CS Ruslan Salakhutdinov Soumit Chintala Chris Olah Ian Goodfellow Andej Karpathy Ilya Sutskever Alex Krizhevsky

Upload: others

Post on 06-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Deep Learning, differentiable programming, and software 2.0

(or white is the new black? ) Mounir BoukadoumUQAM, Dep. CS

RuslanSalakhutdinov

Soumit Chintala

Chris Olah

Ian Goodfellow

AndejKarpathy

Ilya Sutskever

Alex Krizhevsky

Page 2: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Young fields [often] start in a very ad‐hoc manner. Later, the mature field is understood very differently … It seems quite likely that deep learning is in this ad‐hoc state.

Chris Olah, Google Brainhttps://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. … In contrast, Software 2.0 is written in neural network weights. No human is involved in writing  this code.

Andrej Karpathy, OpenAIhttps://medium.com/@karpathy/software‐2‐0‐a64152b37c35

Page 3: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Deep learning has enabled spectacular achievements in solving complex problems of perception and prediction, but…

[With Deep Neural Networks,]machine learning has become alchemy

Ali Rahimi, Google (talk at NIPS 2017)https://www.youtube.com/watch?v=Qi1Yry33TQE

Mostly trial and error success, is there a unifying theory behind current knowledge and practices? 

Page 4: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

So, is there white-box deep learning? At least three ways to approach the issue

• Neuroscience : reproduction of human intelligence (biological analogies)• Probabilities : inference from available data (latent variable manipulation)• Data representations: transformations in manifolds? (differential calculus)

Currently, deep learning is mostly the third approach, using trial and error; could there be a white box model behind the black box appearance?

https://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Page 5: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Artificial neural network (ANN) 101

5/54

Loose metaphor of biological neural networks Interconnected neurons with similar computation types => computational graph

Neuron ‐> node with I/O edges                        Synapse ‐> weighted connection

Page 6: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

A special type of graph

2‐bit adder with NAND gates                                            ANN equivalent

But in ANNs:• The task is automatically learned from the data

– The neural weights and type(s) of neural outputs set the function 

• There is generalization capacity and resilience to imprecision and fragmentary inputshttp://neuralnetworksanddeeplearning.com/chap1.htmlAckerman and Freer, arXiv:1703.09406

Many types de computational graphs exist

Page 7: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Two fundamental topologies

Feedforward architectures good for static problems, recurrent ones for dynamic/ contextual problems (currently studied as “unfolded” feedforward architectures)

7

+BSB, BAM, etc.

Non recurrent  Recurrent 

Neural Network

Page 8: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Three ways to set the neural weights (learning)

All based on the available data:• Supervised learning: the data are labeled

• Unsupervised learning: the data are not labeled; labelling is done based on patterns/similarities  (categorisation); 

• Reinforcement learning: the data are not labeled, labelling is done based on generated output value (expectation versus outcome)

Page 9: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Generic two-step operation1. Training (learning)

Done in advance• By programming (C++, Python, Lua, Java, etc.)• Using a NN simulator (Matlab, SNNS, etc.)

Cross‐validation frequently used for consistent results!

2. Using

Training algorithm

Neural Weights

Patterns to learn

9

ANNPattern to classify

Corresponding output

Neural weights

Page 10: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Seminal architecture of deep learning• 1‐2 hidden layers: shallow; more than two                                                                                 

layers : deep

Essentially a projection operator : given  at the input, provides  at the output

Multi-layer perceptron

10/54

Dynamic/contextual problems handled by recurrent networks that are unfolded 

Page 11: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

MLP learning process Builds a persistent and hierarchical representation of the data information

• Hidden layers progressively learn deeper intermediate representations

Lee, Largman, Pham & Ng, NIPS 2009Lee, Grosse, Ranganath & Ng, ICML 2009

Layer 1

Parts combine to form objects

Layer 3High‐level linguistic representations

Layer 2

Page 12: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

MLP Learning details Supervised; tries to minimize the difference average between a labeled 

training set  , and its neural representation  , • Minimization of the average squared error, expressed as a function of the 

neural weights:   

The intuitive way to solve 0 doesn’t work (requires to know the data statistics!), ametaheuristic is used, with the assumption that E (stochastic gradient descent)• Many variants exist

In any case, the process requires differentiable error functions!

)(wfE

Page 13: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

∆∆

for small ∆ Therefore, ∆ · ∆If w is evolved in the opposite direction to for each learning trial, then∆ and   ∆   0

=> E decreases monotically!

Only the input and final output of the network are known at eachtraining trial, those of the hidden layers must also be determined=> Error backpropagation algorithm (based on the chain rule for derivatives)

13

Stochastic gradient descent

Page 14: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

The more layers, the deeper the learning (or so it seems)

2011: 25,8% error with shallow net2012: 16,4% error with 8 layers

2014: 7,3% error with 19 layers

2015: 3,57 error with 152 layers

Double the human performance, but black box operation!

Kaiming He,Xiangyu Zhang, Shaoqing Ren,& Jian Sun."Deep ResidualLearning for Image Recognition".arXiv 2015.

-....

-

25.8

16.4

22 layers

6.719 layers

7.3

28.2

shallow

ImageNet Large Scale Visual Recognition Competition (ILSVRC)

152 layers!

3,57

8 layers 8 layers

11.7

ILSVRC’10 ILSVRC’11 ILSVRC’12 ILSVRC’13 ILSVRC’14 ILSVRC’14 ILSVRC’15Alexnet VGG GoogLeNet ResNEt

'lmageNet: 1000 objects, 1.2 million imagesTop-5 error (%)

Page 15: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

In sum… Deep learning is essentially many‐layers MLPs trained by error 

backpropagation with (mostly) no side effects At least three technologies and extensions: 

• Autoencoders and deep belief (unsupervized learning)

• Convolutional MLPs (supervised learning)• Generative adversarial networks (supervised learning)• Extensions (e.g., unfolded recurrent architectures)

No white‐box model yet!

Bengio Montréal

HintonToronto

Le CunNew York

15/54

Page 16: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Back box = fad?

We [must] think through artificial intelligence from foundational principles rather than from the empirics of past data

Martin Reeves & Mihnea Moldovean, Scientific American, sep. 2017

Starting May 25, [2018,] the European Union will require algorithms to explain their output, making deep learning illegal.

Reported by Pedro Domingos, U. Washington Seattle, Jan. 2018

What if there is a white box waiting to be uncovered, after all? 

Page 17: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Back to representations…

Each layer processes the output of its predecessor to create a new data representation (function composition!)

If all the nodes are differentiable, task training by error backpropagation is feasible! 

Could this be the start of a white box ANN formalism?

Page 18: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

The functional programming connection Three main ANN characteristics:

• Function composition→ output based on embedded transformations

• End‐to‐end differentiability→  optimization

• Weight‐tying→  sub‐network reusability

Can it be that deep learning is just functional programming with reusable blocks, configured by error backpropagation training?

How so?

Page 19: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Ng et al., proc. ICML 09, pp 609‐616

Transfer learning

Page 20: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Current ANN models

Rectangle = vector; arrow = function. (a) fixed-sized input to fixed-sized output (e.g., image classification); (b) Sequence output (e.g., image captioning); (c) Sequence input (e.g., sentiment analysis); (d) sequence to sequence (e.g., translation); (e) sync’ed sequence to sequence (e.g., video frame tagging). Green layer length is arbitrary, being the result of unfolding a recurrent architecture.

http://karpathy.github.io/2015/05/21/rnn‐effectiveness/

Output

Hidden/State

Input

a)                         b)                                             c)                                      d)         e)

20/54

Page 21: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Special neuron:1 input, 3 controls, 1 output

MemoryCell

Input Gate

Output GateOutput control

Forget Gate

Input

LSTM

Forget control

Output

The control signal typically come from perceptrons

Input control

Long Short‐Term Memory (LSTM) adds a neural structure that enables storing, retrieving or erasing the neural state based on context rather than sequentially

Gated Recurrent Unit (GRU) is a close relative But the LSTM and GRU access mechanisms are not differentiable! 

How about memory?

Page 22: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Making memory access differentiable Necessary for learning where to write and read  Not obvious as memory addresses are fundamentally discrete How about writing  and writing everywhere, just to different extents?

• Approach taken in Neural Turing Machines and several other recent models

https://distill.pub/2016/augmented‐rnns/

Page 23: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Making memory differentiable The idea is to link the memory states to an attention mechanism:

Given a memory context cj and a sequence of memory items hi , i=1..n:• A “distance” aij = f(hi, cj) can be defined for each pair (hi, cj)  

(f can be implemented with a basic feed‐forward network, making it part of the overall ANN)

• The relative weight (attention) of each hi with respect to cj is thenαi=exp(aij)/∑i=1..n exp(aij)

and a composite attention of all hi with respect to cj can be defined asc = ∑i=1..n αi hi

cj is not longer associated with a single item hi  and the steps to distribute across the whole memory are all differentiable!

Page 24: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

ANNs as functional graphs MLPs, CNNs and RNNs are all expressible as graphs where the nodes 

perform layer computations and the arcs layer interconnections Given differentiable nodes, end‐to‐end graph training by error 

backpropagation is possible Two major gains in doing so:

• General purpose computation systems that are automatically configurable for  desired outcomes!

• White box modeling through functional similarities and abstractions

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/ https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/

Page 25: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Functional similarties Weight‐tying (multiple reuse of the same neuron as in CNNs and RNNs)

resembles function abstraction Structural patterns of composition resemble higher‐order functions

(e.g., map, fold, unfold, zip)

25/54

Page 26: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

fold = Encoding RNNHaskell: foldl a

unfold = Generating RNNHaskell: unfoldr a s

Encoding Recurrent Neural Networks are folds

Generating Recurrent Neural Networks are unfolds

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Page 27: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

General Recurrent Neural Networks are accumulating maps.

Accumulating Map = RNNHaskell: mapAccumR a s

Page 28: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Convolutional Neural Networks are a close relative of map.

Windowed Map = Convolutional LayerHaskell: zipWith a xs (tail xs)

Two Dimensional Convolutional Network

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Page 29: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Recursive Neural Networks (“TreeNets”) are catamorphisms, a generalization of folds.

Catamorphism = TreeNetHaskell: cata a

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Page 30: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Examples of building block combinations English to French translation by combining an encoding RNN and a generating RNN, 

to essentially perform a fold followed by unfold (Sutskever, et al. (2014)).

Image captions with a convolutional network and a generating RNN. The CNN doesfeature detection and unfold the resulting vector into a description sentence (Vinyals, et al. (2014)).

30/54

Page 31: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Functional Names of Common LayersDeep Learning Name Functional Name

Learned Vector ConstantEmbedding Layer List IndexingEncoding RNN FoldGenerating RNN UnfoldGeneral RNN Accumulating MapBidirectional RNN Zipped Left/Right Accumulating Maps

Conv Layer “WindowMap”TreeNet CatamorphismInverse TreeNet Anamorphism

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Page 32: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Creating differentiable functional graphs

Make algorithmic elements continuous and differentiableNTM on copy task (Graves et al. 2014)

Create/implement a functional language where all primitives are differentiable and expressible in neural form (save basic arithmetic operations), so that we have:

y = f(x) = σ(Wx + b)  Structural models already exist (Neural Turing Machine; Stack‐augmented RNN; Stack, 

queue, deque), what is missing is the neural programming langageAdapted from http://www.cs.nuim.ie/~gunes/files/Baydin‐MSR‐Slides‐20160201.pdf

Page 33: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Basic differentiable structures based on y = f(x) = σ(Wx + b)  Functional expressions (no mutable data inside)

declarative languages (Lisp, Haskell, Erlang, etc.) function h(x) return f(g(x)) h(x) = σ(W1(σ(W2x + b1)) + b2)

endfunction

function f(x, a) if x > 1.0

return a + 1else +(x, y) = σ(Wx +W‘y + b)

return a f(x, a) = if(x, 1.0, +(a, 1.0), a)endif

endfunction

Needed language constructs

Differentiable if, implemented with a TreeNet neural network

https://pseudoprofound.wordpress.com/2016/08/03/differentiable‐programming/

Page 34: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Functional language constructs

Primitive functions f: T ‐> S to carry out the basic σ(Wx + b) building blocks, with W and blearned from the data.

Mechanism to create composite functions from primitive functions, e.g., mlp(x) = f(g(x))

Higher‐level functions that take functions as inputs, generate functions as outputs, or both

Memory constructs (lists? Monads?) 

=>  calculus!

calculus syntax

All expressions are of the form:

e :: x // variable|x.e1 // function definition|e1 e2 // function application|(e1) // disambiguation

http://colah.github.io/posts/2015‐09‐NN‐Types‐FP/

Page 35: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Examples of higher-order functions map(Fun, List)

• Applies Fun to each element of List, returning a list of results that may be of a different type

filter(Pred, List)• Returns a sublist of List that contains the elements of List that satisfy the predicate Pred

foldl(Fun, Acc, List)• Calls Fun on successive pairs of elements of List , starting with Acc and returning the same type

Etc. 35/54

Page 36: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

More higher-order functions

all(Pred, List) any(Pred, List) takewhile(Pred, List)

dropwhile(Pred, List) flatten(DeepList) flatmap(Fun, List)

foreach(Fun, List) partition(Pred,List) zip(List1,List2)

unzip(List) …

36

Page 37: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Software is dead, long live software?

Current software is imperative (sequence of instructions,  each one imparting  a behaviour to a point in program space)• But for most real‐world problems, it is easier to state desired behaviour (e.g., via input‐

output examples) than to write executable code

V2.0 would be declarative: the “programmer” specifies the outcome  and a composition of neural building blocks is searched for to provide it  • Deep learning searches in continuous manifolds (for dimensionality reduction and to 

make gradient descent possible)

Software should switch from writing programs, maintaining  repositories and doing run‐time analysis to collecting, analyzing and preparing data for a neural network 

Page 38: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Classical program: Sequence of executable instructions to perform a specified task

Differentiable program: Sequence of problem domain declarations on how to perform a specified task  

• Functional blocks for white box operation• Differentiable nodes for auto‐configuration by 

error backpropagation learning

From classical to differentiable machines

Page 39: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

How about existing frameworks? Currently two Types of computational graphs: Symbolic

• Typical representatives: Theano, Tensorflow, CGT• Fine‐grained• Graph analysis and optimizations

Modular• Typical representatives: Torch, Caffe• Coarse‐grained• Manually designed modules

Similarities• Model definition using a (constrained) symbolic language• Automatic handling of backpropagation in the final model

(no need to code derivatives along)

(Kenneth Tran. “Evaluation of Deep Learning Toolkits”.https://github.com/zer0n/deepframeworks)

Page 40: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

You are limited to symbolic graph building, with the mini‐language

You build this symbolic graph:

For example, instead of this in pure Python (for y=Ak):

But no direct functional building as such

http://deeplearning.net/software/theano/library/scan.html 40/54

Page 41: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Current efforts Neural programmers (a bit similar genetic programming) Neural Programmer‐Interpreters (with by‐example supervision) Neural Turing Machines DiffSharp (High‐order differentiation) Autograd (automatic differentiation of numPy and Python code) DNNGraph (Haskell model to caffe and Torch scripts)  Etc.

All in the last couple of years, but is gradient descent really necessary?How about copying biology?

Page 42: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

A biologically-inspired neural building block

Page 43: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

A loop-based neural architecture

Gisiger & Boukadoum, Neural networks, 2018

Page 44: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Delayed-response task (DRT) Tests the ability to respond to stimuli based on short‐term memory Three major steps, repeated over a number of trials: 

• Cue: sensory information to retain (e.g., image, dot on a screen, auditory stimulus)• Delay: The cue is withdrawn for an arbitrary delay;• Response: cue‐related action (e.g., identify a cue image in a set, or point to the location where the dot initially appeared).

Although seemingly simple, the task requires complex mental processing :1. Sensing the cue information, say a visual representation (VR) ;2. committing the cue information to short‐term memory;3. protecting it from interference by external and internal distractions; 4. using the information stored in working memory to produce the correct motor response (PM);5. discarding this information at the end of the trial in preparation for the next one (Reset). 

Page 45: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Implementing DRT with a loop-based network

45/54

Page 46: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

LSTM perspective

Page 47: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin
Page 48: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Many obstacles remain Need for more parallel processing and better energy efficiency

• Both at the hardware and software level

Need for training with less data  Lift the algebraically expressible data restriction (vectors, matrices, 

tensors…) Gradient descent learning is convex optimization; non‐convex techniques 

have not been studied due to apparent NP‐hardness Serious side effects! 

Page 49: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Noise effects (and hacker opportunities!)

http://arxiv.org/pdf/1312.6199v4.pdfhttps://codewords.recurse.com/issues/five/why‐do‐neural‐networks‐think‐a‐panda‐is‐a‐vulturehttps://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196

Page 50: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

The algorithm is deceivably simple

1. Feed in the photo to hack2. Get the neural network’s prediction and see 

how far off it is from the target answer3. Tweak the photo using back‐propagation to 

make the prediction closer to the target answer

4. Repeat steps 1–3 with the same photo until the network gives us the answer we want

Adding an imperceptibly small vector of the same sign as  the gradient of the cost function with respect to the input can drastically change the image classification.https://arxiv.org/abs/1412.6572

+ 0.007 =

x sign( x J (θ, x, y)) x + sign( x J (θ, x, y))“panda” “nematode” “gibbon”

57.7% confidence 8.2% confidence 99.3 % confidence

https://medium.com/@ageitgey/machine‐learning‐is‐fun‐part‐8‐how‐to‐intentionally‐trick‐neural‐networks‐b55da32b7196 50/54

+ 0.007 =

Sometimes, it doesn’t work!

Page 51: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Overfitting

https://ml.berkeley.edu/blog/2017/07/13/tutorial‐4/

9055.5 90555 316942.5 452773 217331

1 = 1

2 = 3

3 = 5

4 = 7

5 = 217341

Answer: 217341!

The consequences cans be disastrous!

Page 52: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

Data order

• Capture of invariant “spatial motives” possible 

22 1A a@a 1 aa a1.a 123 aa1

33 2B b@b 2 bb b2.b 234 bb2

44 3C c@c 3 cc c3.c 345 cc3

55 4D d@d 4 dd d4.d 456 dd4

66 5E e@e 5 ee e5.e 567 ee5

77 6F f@f 6 ff f6.f 678 ff6

88 7G g@g 7 gg g7.g 789 gg7

99 8H h@h 8 hh h8.h 890 hh8

111 9I i@i 9 ii i9.i 901 ii9

• Capture of invariant “spatial motives” doubtful if the row of column order is arbitrary

Page 53: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin

In summary… Efforts are under way to make white the new black  Until then, deep learning remains a black box, and neural 

network parameter tuning an art Currently, the choice is between 80‐90% accurate, non‐

DL models that we understand, or 99% accurate DL models that we don’t!

Page 54: Deep Learning and Differentiable programmingdic.uqam.ca/upload/files/seminaires/Deep Learning and Differentiabl… · principles rather than from the empirics of past data Martin