Download - Deep Learning for Data Mining - Aris Anagnostopoulos - Homearis.me/contents/teaching/data-mining-2018/slides/DeepLearningDat… · Algorithms Deep Learning for Data Mining ... ILSVRC14

Deep Learning for Data Mining

University of Rome "La Sapienza"Dep. of Computer, Control and Management Engineering A. Ruberti

Valsamis Ntouskos, ALCOR Lab

Outline


• Introduction - Motivation

• Theoretical aspects

• Convolutional Neural Networks

• Recurrent Neural Networks

• Generative models

• Further directions

13/12/2018

Deep Learning with CNNs


Compositional Models

Learned End-to-End

Hierarchy of Representations- vision: pixel, motif, part, object

- text: character, word, clause, sentence

- speech: audio, band, phone, word

concrete abstractlearning

Slides from Caffe framework tutorial @ CVPR2015

13/12/2018

Deep Learning with CNNs


Compositional Models

Learned End-to-End

Back-propagation jointly learns

all of the model parameters to

optimize the output for the task.


13/12/2018

Motivation of CNNs


Inputs usually treated as general feature vectors

In some cases inputs have special structure:• Audio• Images• Videos

Signals: Numerical representations of physical quantities

Deep learning can be directly applied on signals by using suitable operators

13/12/2018

Motivation


. . . 0.0468 0.0468 0.0468 0.0390 0.0390 0.0390 0.0546 0.0625 0.0625 0.0390 0.0312 0.0468 0.0625 . . .

1D data - (variable length) vectors

Audio

13/12/2018

Motivation


Images

A sequence of images sampled through time - 3D data

2D data - matrices

Video

13/12/2018

What is a CNN?

Deep Learning for Data Mining 13/12/2018

Some theory

13/12/2018Deep Learning for Data Mining

Convolution

• Image filtering is

based on convolution

with special kernels

Some theory


Pooling

• Introduces subsampling

13/12/2018

Some theory


Activation

Standard way to model a neuron

f(x) = tanh(x) or f(x) = (1 + e-x)-1

Very slow to train (saturation)

Non-saturating nonlinearity (RELU)f(x) = max(0, x)Quick to train

13/12/2018

Some theory


Regularization

Dropout

• Applied on the fully-connected layers

• During training prune nodes with probability α

• During testing nodes are weighed by α

Image from Srivastava et al.. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting"

13/12/2018

Some theory


Every convolutional layer of a CNN transforms the 3D input

volume to a 3D output volume of neuron activations.

A regular 3-layer Neural Network

Material from Fei-Fei’s group

13/12/2018

Some theory


Each neuron is connected to a

local region in the input volume

spatially, but to all channels

The neurons still compute a dot

product of their weights with the

input followed by a non-linearity


13/12/2018

Algorithms


• Each* neuron/layer is differentiable!

• Just apply backpropagation (chain-rule)

• Use standard gradient-based optimization algorithms

(SGD, AdaGrad, …)

• The devil lies in the details though …

▪Choosing hyperparameters / loss-function

▪Exploding/Vanishing gradients – batch normalization

▪Overfitting – Regularization

▪Cost of performing experiments

▪Convergence

▪…

*what about max-pooling?

13/12/2018

Kernels and

Feature maps



13/12/2018

Case study: image classification



13/12/2018


Case study: image classification


13/12/2018

Brief history of CNNs

Foundational work done in the middle of the 1900s

• 1940s-1960s: Cybernetics [McCulloch and Pitts 1943,

Hebb 1949, Rosenblatt 1958]

• 1980s-mid 1990s: Connectionism [Rumelhart 1986,

Hinton 1989]

• 1990s: modern convolutional networks [LeCun et al.

1998], LSTM [Hochreiter & Schmidhuber 1997,

MNIST and other large datasets]




Hubel & Wiesel [60s] Simple & Complex cells architecture:

Hubel & Wiesel [60s] Simple & Complex cells architecture Fukushima’s Neocognitron [70s]

Yann LeCun’s Early CNNs [80s]:

13/12/2018



Convolutional Networks: 1989

LeNet: a layered model composed of convolution and subsampling operations followed

by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ]

13/12/2018

Recent success

• Parallel Computation (GPU)

• Larger training sets

• International Competitions

• Theoretical advancements

– Dropout

– ReLUs

– Batch Normalization


Recent success


CUDA Jetson TX1, TK1

Android lib, demo

OpenCL branch

Better Hardware – GPUs

13/12/2018

http://www.nvidia.com/object/jetson-tx1-dev-kit.html

http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html

https://github.com/sh1r0/caffe-android-lib

https://github.com/sh1r0/caffe-android-demo

https://github.com/BVLC/caffe/tree/opencl

Recent success

ImageNet

• Over 15M labeled high resolution images

• Roughly 22K categories

• Collected from web and labeled by Amazon Mechanical Turk


Larger training sets

13/12/2018

Recent success

ILSVRC

• Annual competition of image classification at large scale

• 1.2M images in 1K categories

• Classification: make 5 guesses about the image label


Competitions

EntleBucher Appenzeller

13/12/2018

CNNs in Computer Vision


• Image classification

13/12/2018

Evolution of CNNs for image classification


AlexNet: a layered model composed of convolution, subsampling, and further operations

followed by a holistic representation and all-in-all a landmark classifier on

ILSVRC12. [ AlexNet ]

Convolutional Nets: 2012

AlexNet

13/12/2018




ILSVRC14 Winners: ~6.6% Top-5 error

- GoogLeNet: composition of multi-scale

dimension-reduced modules

+ depth

+ data

+ dimensionality reduction

13/12/2018




ILSVRC14 Winners: ~6.6% Top-5 error

- VGG: 16 layers of 3x3 convolution

interleaved with max pooling +

3 fully-connected layers

+ depth

+ data

+ dimensionality reduction

13/12/2018




ResNet

ILSVRC15 Winner: ~3.6% Top-5 error

Intuition: Easier to learn zero than identity function

13/12/2018

Reasonable questions

• Is this just for a particular dataset? – No!


Slides from ICCV 2015 Math of Deep Learning tutorial

13/12/2018



Object Localization[R-CNN, HyperColumns, Overfeat, etc.]

Pose estimation [Thomson et al, CVPR’15]

• Is this just for a particular task? – No!


13/12/2018



Semantic Segmentation[Pinhero, Collobert, Dollar, ICCV’15]

• Is this just for a particular task? – No!


13/12/2018

Fine Tuning


Dogs vs. Cats

top 10 in 10 minutes

Take a pre-trained model and fine-tune to new tasks [DeCAF] [Zeiler-Fergus] [OverFeat]

© kaggle.com

Your Task

Style

RecognitionLots of Data

ImageNet

13/12/2018

RNNs

Dealing with sequences

• data points are related

• order is important!

(C)NNs disregard this information


RNNs

Recurrent Neural Networks (RNNs) address this problem by introducing cells with loops


• allow information to persist

• the value of output 𝐡𝑡 depends both on 𝐱𝑡 and the previous output 𝐡𝑡−1

Material from Colah’s blog

RNNs

Another way to see RNNs is to unfold the loop

• each cell dedicated to a different data sample

• output 𝐡𝑡 computed based also on 𝐡𝑡−1


RNNs

RNN structure


Material from Colah’s blog

RNNs

RNNs (and variants) are very effective in various domains:

• Speech recognition

• Translation

• Image/Video captioning

• Video summarization

• Activity recognition

• ...


RNNs

a) "The clouds are in the _."


ProblemRNNs cannot capture long-term dependencies• so-called vanishing/exploding gradients

problem

b) "I grew up in France... I speak fluently _."

LSTMs

Long Short Term Memories (LSTMs)

• RNNs with special structure

• designed to capture long-term dependencies


Hochreiter & Schmidhuber (1997) Long short-term memory

LSTMs

Main idea: Separate cell output 𝐡𝑡 and cell state 𝐶𝑡• state is modified through a series of

structures called gates


LSTMs – step by step

1st gate: forget mechanism

• drop elements of 𝐶𝑡−1 based on values of 𝐡𝑡−1and input 𝐱𝑡



2nd and 3rd gate: update mechanism

i. decide which elements of 𝐶𝑡−1 to update

ii. compute the update vector ሚ𝐶𝑡



2nd and 3rd gate: update mechanism

iii. update elements selected via 𝑖𝑡 with values computed in ሚ𝐶𝑡



Last step: compute output

• Based on 𝐶𝑡 , 𝐡𝑡−1, 𝐱𝑡


LSTM variants 1/3

Add peephole connections

All layers “see” the state 𝐶𝑡−1


Gers & Schmidhuber (2000) Recurrent nets that time and count

LSTM variants 2/3

Coupled input and forget gates:

• only forget something if you are going to input something else


LSTM variants 3/3

Gated Recurrent Units (GRUs)

• no cell state! - directly operate on output 𝐡𝑡• just two gates instead of three


Yao et al. (2015) Depth-gated recurrent neural networks

LSTMs

Further improvements:

• Further modifications of gates/structure

• Bi-directional LSTMs

• Attention mechanisms

• ...


LSTMs – typical use cases

One-to-one – classical NNs (no RNN)

One-to-many – sequence output (e.g. image captioning)

Many-to-one – sequence input (e.g. sentiment analysis)

Many-to-many – sequence to sequence (e.g. machine translation)

Many-to-many synced – e.g. frame per frame video analysis


LSTMs – use case examples


Visual Sequence Tasks

Jeff Donahue et al. CVPR’15

68

Based on Long short-term memory (LSTM) layers

13/12/2018

Using LSTMs

Some videos:

- Facebook AI

- Music classification

- Action recognition SecondHands project


https://www.youtube.com/watch?v=-CRJLam3BNc

https://www.youtube.com/watch?v=nOdz80J4_Rc

Beyond Regression/Classification

Generative models

• Variational Auto-Encoders (VAEs)

– focus on learning latent space structure

• Generative Adversarial Networks (GANs)

– focus on learning a distribution

– no latent space (in general)


GANs

Goal

Sample from the input data distribution 𝒳

Idea

Invert a (Convolutional) Neural Network


GANs

Encoder (conventional CNN)

• processes an image and produces a vector/code


GANs

Decoder

• receives a code and produces an image

• uses “deconvolutional” layers


Material from Frans’ blog

GANs

Problem

How to train the decoder to produce meaningful data?

Idea

Adversarial Training


GANs

GAN is a combination of two networks

1. a generator network (decoder)

2. a discriminator network (critic)


GANs

Roles

• Generator: produces realistic samples of 𝒳

• Discriminator: identifies if a sample actually comes from the (unknown) 𝒳 or not

Training

make the networks compete with each other

• generator tries to fool the discriminator in believing that the sample is ‘real’

• discriminator tries to discriminate as good as possible ‘real’ from ‘fake’ samples


GANs

Example on CIFAR dataset (32x32 images)

Are these real images or generated?


GANs

Example on CIFAR dataset (32x32 images)

Results at 300, 900 and 5700 iterations


GANs

Example: Celebrity faces (1024x1024 images)


https://www.youtube.com/watch?v=XOxxPcy5Gr4

VAEs

Goal

• modify data in specific directions

• identify meaningful directions in latent space

Examples

- ‘distort’ faces (change expression, add glasses)

- produce digits from different hand-writing styles

- distort 3D meshes

- ... 13/12/2018Deep Learning for Data Mining

VAEs

What is an auto-encoder?

- a combination of an encoder and a decoder

- trained based on reconstruction loss

- provides low-dimensional representation


Material from Kurita’s blog

VAEs

Goal

Feed vectors and get realistic samples of 𝒳

AE problem: we don’t know if the vectors are valid or not


VAEs

Main idea:

Encoder produces a distribution instead of a vector

Decoder operates on samples from this distribution


VAEs

Problems

1. How to produce a distribution?

2. How to prevent degeneration?

Solutions

1. parametric distributions – typically Gaussian

– produce mean 𝜇 and variance Σ

2. add loss term based on Kullback-Leiblerdivergence


VAEs

One more problem:

- Sampling operation is not differentiable

Solution:

- re-parametrization


Images from Keita Kurita’s blog

VAEs

Example: faces (Hou et al., 2016)


VAEs

Example: 3D Mesh deformation (Tan et al., 2018)


Further directions

Few-shot and zero-shot learning

• meta-learner architectures

• Prototypical networks


Further directions





One –show learning example: Mullet Zero–show learning example: Zebra (from horse)

Further directions




Learning on manifolds

• learning functions defined on surfaces


Further directions






• learning functions defined on graphs


Further directions






• learning functions defined on graphs

Learning with constraints

• Frank-Wolf networks


Resources

High-level libraries:

• Keras | Backends: TensorFlow (TF), Theano

Models:

• Depends on the framework, e.g.

– https://github.com/BVLC/caffe/wiki/Model-Zoo (Caffe)

– https://github.com/tensorflow/models/tree/master/research (TF)

Interactive Interfaces:

• DIGITS (NVIDIA) | Caffe, TF, Torch

• TensorBoard (TF)

Tools:

• http://ethereon.github.io/netscope (for networks defined in protobuf )


https://github.com/BVLC/caffe/wiki/Model-Zoo

https://github.com/tensorflow/models/tree/master/research

http://ethereon.github.io/netscope

Thank you!


Download - Deep Learning for Data Mining - Aris Anagnostopoulos - Homearis.me/contents/teaching/data-mining-2018/slides/DeepLearningDat… · Algorithms Deep Learning for Data Mining ... ILSVRC14

Top Related