applications and insights - university of...

Convolutional Neural Network

Applications and InsightsChristof Angermueller and Alex Kendall

Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks."

Application 1: Classification

http://www.youtube.com/watch?v=qrzQ_AB1DZk

http://cs.stanford.edu/people/karpathy/deepvideo/

http://cs.stanford.edu/people/karpathy/deepvideo/

Visual Classification

Attention-grabbing image classification performanceClarifai classification demo

http://www.clarifai.com/

http://www.clarifai.com/

Large Scale Classification

Classification advances driven by:

● Large datasets such as ImageNet, Places with millions of images

● Annual ImageNet Challenge (ILSRC)

http://www.image-net.org/

http://places.csail.mit.edu/

Depth over widthA function which is invariant to the many nuisance variables (pose, occlusion, lighting, clutter) is very complex and nonlinear

These functions are more efficiently represented with depth rather than width

● sequential mapping to connected spaces● deeper layers reuse computation

(On the Number of Linear Regions of Deep Neural Networks)

Deep architectures consistently outperform shallow representations with comparable networks (Return of the Devil in the Details: Delving Deep into Convolutional Networks)

http://papers.nips.cc/paper/5422-on-the-number-of-linear-regions-of-deep-neural-networks.pdf

http://arxiv.org/pdf/1405.3531v4.pdf

Very deep architectures

1989: LeNet, 5 layers2006: Autoencoders, 7 layers2012: Alex Net, 9 layers2014: GoogLeNet, 22 layers and current ILSRC winner(‘Going Deeper with Convolutions’)


What constrains depth?● GPU Memory - more efficient architectures

○ dimensionality reduction kernels

● Over-fitting

○ data-augmentation

○ drop out

● Back-propagated gradient magnitude decay

○ multi-loss training with auxiliary classifiers

Leverage Data HierarchyStrong hierarchy in data● Image recognition: Pixel → edge → texton → motif → part → object ● Text: Character → word → word group → clause → sentence → story● Speech: Sample → spectral band → sound → phone → word

Strong hierarchy in biological architectures:Thorpe, Simon, Denis Fize, and Catherine Marlot. "Speed of processing in the human visual system." nature 381.6582 (1996): 520-522.

Understanding deep representationsFirst layer filters for edges, blobs and low level features. Interesting when trained on dual GPUs a distinction forms between sharp monochrome features (rods) and colour blobs (cones) (ImageNet Classification with Deep Convolutional Neural Networks)

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

Hierarchy and Multi-ScalesA neuron’s receptive field increases in size with depth● Initial layer features are more discriminative● Deeper layers are more invariant and capture

semanticsDifferent and complementary features exist at different spatial scalesDepth Multiscale: Hypercolumns represent features over entire depth abstractionSpatial MultiScale: GoogLeNet uses multi scale filters in inception modules

http://arxiv.org/pdf/1411.5752.pdf

DeconvolutionWe can visualise the convolutional filters to find deficiencies in architecture. As you go deeper the filters represent more semantic concepts, similar to V1-V4 of the visual pathway in humans (Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks.")


Summary of Classification Insights

1. Large, augmented datasets2. Maximise depth (while avoiding overfitting

and vanishing gradients)3. Use multiscale and multi depth information

Application 2: Instantiation Variable Regression

http://www.youtube.com/watch?v=u0MVbL_RyPU

Multi-Dimensional RegressionInstead of training a softmax classifier, an euclidean loss function can be used to train regression output

For example to regress camera location, x, and orientation, q, we can use the loss function

Multi-Dimensional RegressionDespite convnets being large piecewise linear function they can still continuously regress pose and instantiation variables - map to linear space● Human pose (Deep Pose: Human pose recognition.)

● Alex’s unpublished work in camera pose localisation


Saliency maps We can view the gradient of the pixels w.r.t. the outputVisualising these (back-propagated) gradients is called a saliency mapShow areas of the image, and features, which are most importantBack-propagated gradients are a generalisation of deconvolution (Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image classification models and saliency maps.")




Summary of Regression Insights

1. Convnet transforms data to a space linear in a number of instantiation parameters

2. Context is extremely important to understand the data

Other Applications

1. Image caption generation2. Text recognition3. Reinforcement learning

Image Caption Generation

Image Caption GenerationObject detection combined with a multimodal Recurrent Neural Network architecture that uses the detected descriptions to learn to generate descriptions of image regions

● Karpathy et al., ‘Deep visual-semantic alignments for generating image descriptions‘● Vinyals et al., ‘Show and Tell’.

http://cs.stanford.edu/people/karpathy/cvpr2015.pdf

http://cs.stanford.edu/people/karpathy/cvpr2015.pdf

http://arxiv.org/abs/1411.4555

http://arxiv.org/abs/1411.4555

Text Recognition (OCR)Using superpixels to generate region proposals for convnets has been used for many applications, eg. OCR (Reading Text in the Wild with Convolutional Neural Networks, Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks )


http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42241.pdf

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42241.pdf

Reinforcement LearningSpatial and temporal input through convolutional neural network to output joystick commands for a video game (Mnih et al., ‘Human-Level Control through Deep Reinforcement Learning’)

Same architecture trained on 100 atari games with separate weights trained for each game to maximise score

http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html



Reinforcement Learning

http://www.youtube.com/watch?v=iqXKQf2BOSE

Final Insights

● Feature vectors from convolutional neural networks contain rich representations of images

● Invariant to nuisance variables and linear in a number of instantiation parameters

● Improvement of convnets over SIFT features is approx. equal to the improvement of SIFT over simple RGB patches

Conclusion● Convnets are pushing state-of-the-art in understanding

data with spatial structure● Produce powerful and transferable representationsHowever,● Can be hard to train and regularise● Very hard to get labelled data to train● Deep representations tend to lose spatial accuracy

applications and insights - university of...

Documents