aws re:invent 2016: deep learning at cloud scale: improving video discoverability by scaling up...

November 2016

MAC205

Deep Learning at Cloud ScaleImproving Video Discoverability by Scaling Up Caffe on AWS

Andres Rodriguez, PhD, Solutions Architect, Intel Corporation

Juan Carlos Riverio, CEO, Vilynx

Content Outline

• Deep learning overview and usages

• Worked example for fine-tuning a NN

• Some theory behind deep learning

• Vilynx – videos discoverability

Deep Learning

• A branch of machine learning

• Data is passed through multiple non-linear

transformations

• Goal: Learn the parameters of the transformation that

minimize a cost function

Bigger Data Better Hardware Smarter Algorithms

Why Now?

Image: 1000 KB / picture

Audio: 5000 KB / song

Video: 5,000,000 KB / movie

Transistor density doubles

every 18 months

Cost / GB in 1995: $1000.00

Cost / GB in 2015: $0.03

Advances in algorithm innovation, including neural networks, leading to better accuracy in training models

Types of Deep Learning

• Supervised learning

• Data -> Labels

• Unsupervised learning

• No labels; Clustering; Reducing dimensionality

• Reinforcement learning

• Reward actions (e.g., robotics)

http://ode.engin.umich.edu/presentations/idetc2014/img/image_feature_learning_clear.png

output expected

0.10 0.15 0.20 …0.05

person cat dog bike

0 1 0 … 0

person cat dog bike

penalty(error or cost)

Forward

Propagation

Training

output expected

person cat dog bike

0 1 0 … 0

person cat dog bike

inference

Training

0.10 0.15 0.20 0.05

penalty(error or cost)

… …

Forward

Propagation

Deep Learning Use Cases

• Fraud / face detection

• Gaming, check processing

• Computer server

monitoring

• Financial forecasting and

prediction

• Network intrusion

detection

• Recommender systems

• Personal assistant

• Automatic Speech

recognition

• Natural language

processing

• Image & Video

recognition/tagging

• Targeted Ads

Cloud Service Providers

Financial

Services

Healthcare

Automotive

Optimized Deep Learning Environment

Fuel the development of vertical solutions

Deliver excellent deep learning environment

Develop deep networks across frameworks

Maximum performance on Intel architecture

Intel® Math Kernel Library (Intel® MKL)

Elastic Compute Cloud (EC2)

C4 Instances

• “Highest performing processors and the lowest price/compute

performance in EC2”1

• Vilynx

• Deep learning for video content extraction

• Supports various companies: CBS, TBS, etc.

1https://aws.amazon.com/ec2/instance-types/https://www.stlmag.com/news/st-louis-app-pikazo-will-turn-your-profile-picture/

• Pikazo app

• Transforms photos into artistic render

Elastic Compute Cloud (EC2)

C4 Instances

c4.8xlarge On-Demand:

• $1.675/hr

GoogleNet inference:

• batch size 32

• 237 ims/sec = 4.2 ms/im

• 1 million images costs

Spot prices are cheaper

OS: Linux version 3.13.0-86-generic (buildd@lgw01-51) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #131-Ubuntu SMP Thu May 12 23:33:13

UTC 2016. MxNet Tip of tree: commit de41c736422d730e7cfad72dd6afc229ce08cf90, Tue Nov 1 11:43:04 2016 +0800. MKL 2017 Gold update 1

6.1 2.4 1.2 0.8

79.7 73.9

AlexNet GoogLeNet v1 ResNet-50 GoogLeNet v3

c4.8xlarge MXNet Inference

No MKL MKL

Intel® Math Kernel Library 2017 (Intel® MKL 2017)

• Optimized for EC2 instances with Intel® Xeon® CPUs

• Optimized for common deep learning operations

• GEMM (useful in RNNs and fully connected layers)

• Convolutions

• Pooling

• ReLU

• Batch normalization

Recurrent NN Convolutional NN12

Naïve Convolution

https://en.wikipedia.org/wiki/Convolutional_neural_network

Cache Friendly Convolution

arxiv.org/pdf/1602.06709v1.pdf

Gradient Descent

𝐽 𝒘(0) =

𝑖=1

𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝒘𝒘(0)

Gradient Descent

𝐽 𝒘(0) =

𝑖=1

𝒘𝒘(0)

𝑑𝐽 𝒘(0)

𝑑𝒘

Gradient Descent

𝐽 𝒘(0) =

𝑖=1

𝒘𝒘(0)

𝒘(1) = 𝒘(0) −𝑑𝐽 𝒘(0)

𝑑𝒘

Gradient Descent

𝐽 𝒘(0) =

𝑖=1

𝒘𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

learning rate

Gradient Descent

𝐽 𝒘(0) =

𝑖=1

𝒘𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

𝒘(1)

too small

Gradient Descent

𝐽 𝒘(0) =

𝑖=1

𝒘𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

𝒘(1)

too large

Gradient Descent

𝐽 𝒘(0) =

𝑖=1

𝒘𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

𝒘(1)

good enough

Gradient Descent

𝐽 𝒘(1) =

𝑖=1

𝒘𝒘(2)

𝒘(2) = 𝒘(1) − 𝛼𝑑𝐽 𝒘(1)

𝑑𝒘

𝒘(1)

Gradient Descent

𝐽 𝒘(2) =

𝑖=1

𝒘(3) = 𝒘(2) − 𝛼𝑑𝐽 𝒘(2)

𝑑𝒘

𝒘(2)

𝒘(3)

Gradient Descent

𝐽 𝒘(3) =

𝑖=1

𝒘(4) = 𝒘(3) − 𝛼𝑑𝐽 𝒘(3)

𝑑𝒘

𝒘(4)

𝒘(3)

Transfer learning via fine-tuning

• First few layers are usually very similar within a domain

• Last layers are task specific

• Take a trained model and fine-tune it for a particular task

http://vision.stanford.edu/Datasets/collage_s.png

https://www.kaggle.com/c/dogs-vs-cats

http://adas.cvc.uab.es/task-cv2016/papers/0026.pdf

• Install Intel-Optimized Caffe (or your favorite framework)

• https://software.intel.com/en-us/articles/training-and-deploying-deep-

learning-networks-with-caffe-optimized-for-intel-architecture

• Download a pre-trained model

• http://dl.caffe.berkeleyvision.org/bvlc_reference_caffenet.caffemodel

• Modify the training model (next slide)

Fine-tuning steps

Fine-tuning: ILSVRC -> DogsVsCatslayer {

type: "Data"

data_param {

source: "ilsvrc12_train_lmdb"

layer {

type: "InnerProduct"

inner_product_param {

num_output: 1000

layer {

type: "Data"

data_param {

source: “dogs_cats_train_lmdb"

layer {

type: "InnerProduct"

inner_product_param {

num_output: 2

}>> # From the command line

>> caffe train -solver solver.prototxt -weights trainedModel.caffemodel

Fine-tuning guidelines

• Freeze all but the last layer (or more if new dataset is very different)

• lr_mult=0 in local learning rates

• Earlier layer weights won't change very much

• Drop the initial learning rate (in the solver.prototxt) by 10x

Replace 1000 with 2 unit layer

Train the 4096+1 x 2 weights

http://www.mdpi.com/remotesensing/remotesensing-07-14680/article_deploy/html/images/remotesensing-07-14680-g002-1024.png

• Fine-tune trained model for dog vs cats

Juan Carlos Riveiro: CEO and Cofounder

How?. . .Building the biggest dataset for video deep learning by auto tagging selected video

scenes in real-time and leveraging web and social media to continues update the tags

Hello. We're Vilynx, the video personalization company

We select the relevant contents targeted to individual needs

solving the content discovery problem.

Benefit?..Increase views, time spent watching videos and in video search.

Markets: Media, Smart Phones, Drones, Security, Robots, Smart Cities.

Outstanding Tech Team: Experienced and Very Successful

Juan Carlos Riveiro, CEO

More than 100 patents in Signal Processing, Data

statistics/algorithms and Machine Learning.

Founder and CEO of Gigle Networks (Acquired by

Broadcom),

CTO & VP of R&D at DS2 (Acquired by Marvell).

Elisenda Bou, CTO

PhD from UPC and MIT and expert on Machine Learning

and Complex SW Architectures. Worked on adaptive

satellite control systems and recipient of the 2013 Google

Faculty Research Awards.

José Cordero Rama

MS for Deep Learning at UPC/BSC

Data Scientist at King, Bdigital and Gen-Med

Joan Capdevila, PhD

MS and PhD for Machine Learning

At Georgia Tech and UPC/BSC

Data Scientist at AIS and Accenture

Jordi Pont-Tuset, PhD

PostDoc on Machine Learning at ETH Zurich

PhD on Image Segmentation at UPC

Disney Research

Asier Aduriz

Computer Science and Telecom Engineering

degree at UPC (Top 1% of class)

Engineer at CERN.

Dèlia Fernàndez

MS on Deep Learning at Columbia University

Signal Processing Researcher at Northeastern University

Data Scientist at InnoTech

David Varas, PhD

PhD for Video Object Tracking at UPC

Adjunct Professor on Computer Vision &

Statistical Signal Processing at UPC

Vilynx: Indexing Visual Knowledge

8 cameras/car

Smart Cities

Connecting Everything

VR/AR Changing Everything A camera at every

corner in London

Drones everywhere (Amazon)

How is all this visual content going to be indexed?

Just like the internet before Google

+1000 hours of video uploaded

every minute in internet

The Vilynx Knowledge Graph

The average vocabulary of a 5-year

old is 5000 words

• 4800 words/concepts

• 1.8 tags per video

• 8M videos

The average vocabulary of an adult

is 30,000 words

• 2M words/concepts

• 50 tags per video

• 10M videos 34

First Market driven by Video Content Producers

Media companies need content personalization to drive audience

through multiple channels

Some Customer Examples:

http://www.cbs.com/shows/the-late-

show-with-stephen-colbert/

https://www.americasgreatestmakers.com/

http://www.vanitatis.elconfidencial.com/36

Vilynx Products

Inputs:

Outputs:

Applications:

Videos Audience DataContextual Data:

Social Networks, YouTube, Web

Key 5 sec clips Intelligent Auto Tagging

• Better video

discovery

• Native Ad

integration

• Programmatic

Ad matching

• More video

views and

longer

engagement

• VOD & Live

Events

• Drive branding

• Amplification

with keyword

recommendation

• Drive Click

through rates

• Better user

experience

Video Thumbnails Social Sharing Recommendations Video Search Ad Market

Vilynx | Workflow

Machine Learning or Deep Learning

98% accuracy to find the relevant parts of the video

CTR increase between 50% to 500% (customer validated)38

1. We ingest customer videos and the contextual information around it.

2. We then take cues from around the Web and social networks.

3. This combined input is fed to the most advanced convolutional deep neural network in the industry.

4. Output are video previews optimized to engage your audience and rich metadata that can further drive your video content.

A data training set of video moments that includes:

10M (and growing) tagged 5 sec video moments,

ImageNet for video has only 4000 moments

2M Contextual tags (and growing)

Continuously updated training set of new tags by

crawling of social media/the web

Real time unsupervised training of the network to

autonomously learn and identify new patterns

Advancing Deep Learning Networks:

Move from simple classification to indexing all visual content

Demo Results

• Fine-tune dogs vs cats classifier results

Call to action

• Use Intel Optimized Frameworks for workloads

• https://github.com/intel/caffe

• https://github.com/dmlc/mxnet

• https://github.com/intel/theano

• https://github.com/intel/torch

• other frameworks coming soon…

• Deep learning tutorial

• https://software.intel.com/en-us/articles/training-and-deploying-deep-learning-networks-with-caffe-

optimized-for-intel-architecture

• Distributed training of deep networks on AWS

• https://software.intel.com/en-us/articles/distributed-training-of-deep-networks-on-amazon-web-

services-aws

Legal Notices & Disclaimers

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice.

Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at

intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual

performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance

and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may

affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a

number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the

annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current

characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm

whether referenced data are accurate.

Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Thank you!

(huge) contributions from:

Joseph Spisak, Elisenda Bou, Hendrik Van der Meer, Zhenlin Luo, Ravi Panchumarthy,

Ryan Saffores, Niv Sundaram, and many more..

Remember to complete

your evaluations!

aws re:invent 2016: deep learning at cloud scale: improving video discoverability by scaling up...

Technology

aws launch wizard for sql server - user guide...aws launch...

wph203 content choice discoverability demo

caffe mela

heavens caffe

discorank: optimizing discoverability on soundcloud

improving discoverability

pratik caffe

online book discoverability by deltinau

orcid ids: optimizing research discoverability

carravaggio caffe

connectingsystems to enhance discoverability

discoverability and related content linking

natural bm05 caffe latte intense imperial ......bm05 caffe...

universal caffe

caffe latte

caffe lavazza - eataly · caffe corretto 8 la via del te...

enterprise node - code discoverability

discoverability - trender och strategier

macchina caffe

sage white paper on discoverability