applications and insights - university of...
TRANSCRIPT
Convolutional Neural Network
Applications and InsightsChristof Angermueller and Alex Kendall
Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks."
Application 1: Classification
Visual Classification
Attention-grabbing image classification performanceClarifai classification demo
Large Scale Classification
Classification advances driven by:
● Large datasets such as ImageNet, Places with millions of images
● Annual ImageNet Challenge (ILSRC)
Depth over widthA function which is invariant to the many nuisance variables (pose, occlusion, lighting, clutter) is very complex and nonlinear
These functions are more efficiently represented with depth rather than width
● sequential mapping to connected spaces● deeper layers reuse computation
(On the Number of Linear Regions of Deep Neural Networks)
Deep architectures consistently outperform shallow representations with comparable networks (Return of the Devil in the Details: Delving Deep into Convolutional Networks)
Very deep architectures
1989: LeNet, 5 layers2006: Autoencoders, 7 layers2012: Alex Net, 9 layers2014: GoogLeNet, 22 layers and current ILSRC winner(‘Going Deeper with Convolutions’)
What constrains depth?● GPU Memory - more efficient architectures
○ dimensionality reduction kernels
● Over-fitting
○ data-augmentation
○ drop out
● Back-propagated gradient magnitude decay
○ multi-loss training with auxiliary classifiers
Leverage Data HierarchyStrong hierarchy in data● Image recognition: Pixel → edge → texton → motif → part → object ● Text: Character → word → word group → clause → sentence → story● Speech: Sample → spectral band → sound → phone → word
Strong hierarchy in biological architectures:Thorpe, Simon, Denis Fize, and Catherine Marlot. "Speed of processing in the human visual system." nature 381.6582 (1996): 520-522.
Understanding deep representationsFirst layer filters for edges, blobs and low level features. Interesting when trained on dual GPUs a distinction forms between sharp monochrome features (rods) and colour blobs (cones) (ImageNet Classification with Deep Convolutional Neural Networks)
Hierarchy and Multi-ScalesA neuron’s receptive field increases in size with depth● Initial layer features are more discriminative● Deeper layers are more invariant and capture
semanticsDifferent and complementary features exist at different spatial scalesDepth Multiscale: Hypercolumns represent features over entire depth abstractionSpatial MultiScale: GoogLeNet uses multi scale filters in inception modules
DeconvolutionWe can visualise the convolutional filters to find deficiencies in architecture. As you go deeper the filters represent more semantic concepts, similar to V1-V4 of the visual pathway in humans (Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks.")
Summary of Classification Insights
1. Large, augmented datasets2. Maximise depth (while avoiding overfitting
and vanishing gradients)3. Use multiscale and multi depth information
Application 2: Instantiation Variable Regression
Multi-Dimensional RegressionInstead of training a softmax classifier, an euclidean loss function can be used to train regression output
For example to regress camera location, x, and orientation, q, we can use the loss function
Multi-Dimensional RegressionDespite convnets being large piecewise linear function they can still continuously regress pose and instantiation variables - map to linear space● Human pose (Deep Pose: Human pose recognition.)
● Alex’s unpublished work in camera pose localisation
Saliency maps We can view the gradient of the pixels w.r.t. the outputVisualising these (back-propagated) gradients is called a saliency mapShow areas of the image, and features, which are most importantBack-propagated gradients are a generalisation of deconvolution (Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image classification models and saliency maps.")
Summary of Regression Insights
1. Convnet transforms data to a space linear in a number of instantiation parameters
2. Context is extremely important to understand the data
Other Applications
1. Image caption generation2. Text recognition3. Reinforcement learning
Image Caption Generation
Image Caption Generation
Image Caption GenerationObject detection combined with a multimodal Recurrent Neural Network architecture that uses the detected descriptions to learn to generate descriptions of image regions
● Karpathy et al., ‘Deep visual-semantic alignments for generating image descriptions‘● Vinyals et al., ‘Show and Tell’.
Text Recognition (OCR)Using superpixels to generate region proposals for convnets has been used for many applications, eg. OCR (Reading Text in the Wild with Convolutional Neural Networks, Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks )
Reinforcement LearningSpatial and temporal input through convolutional neural network to output joystick commands for a video game (Mnih et al., ‘Human-Level Control through Deep Reinforcement Learning’)
Same architecture trained on 100 atari games with separate weights trained for each game to maximise score
Reinforcement Learning
Final Insights
● Feature vectors from convolutional neural networks contain rich representations of images
● Invariant to nuisance variables and linear in a number of instantiation parameters
● Improvement of convnets over SIFT features is approx. equal to the improvement of SIFT over simple RGB patches
Conclusion● Convnets are pushing state-of-the-art in understanding
data with spatial structure● Produce powerful and transferable representationsHowever,● Can be hard to train and regularise● Very hard to get labelled data to train● Deep representations tend to lose spatial accuracy