scaling to meet the growing needs of artificial...

66
Scaling to Meet the Growing Needs of Artificial Intelligence (AI) Pradeep K. Dubey Intel Fellow, Intel Labs, Intel Corporation

Upload: others

Post on 28-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Scaling to Meet the Growing Needs of Artificial Intelligence (AI)

Pradeep K. Dubey

Intel Fellow, Intel Labs, Intel Corporation

Page 2: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat
Page 3: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Artificial Intelligence

Page 4: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Artificial Intelligence

Machine Learning

A B

C

𝑨 ∨ 𝑩 = ¬𝑨 ∧ ¬𝑩

xx

x

x

x

o

oo

o oo

Page 5: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Artificial Intelligence

Machine Learning

A B

C

𝑨 ∨ 𝑩 = ¬𝑨 ∧ ¬𝑩

xx

x

x

x

o

oo

o oo

Neural Networks

Page 6: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Agenda

• Why do we need to scale machine learning?

• What makes it hard to scale and how we are addressing it

• Real-world applications

• Hardware roadmap, software tools and frameworks update

3

Page 7: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Deep Learning: Scoring or Inferencing

4

Person

Deep Neural Network Model

Hidden Layers

Page 8: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Deep Learning: Scoring or Inferencing

4

PersonForward Propagation

Deep Neural Network Model

Hidden Layers

Page 9: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Deep Learning: Training

5

* Shihao Ji, S. V. N. Viswanathan, Nadathur Satish, Michael Anderson, and Pradeep Dubey. Blackout: Speeding up Recurrent NeuralNetwork Language Models with very large vocabularies. http://arxiv.org/pdf/1511.06909v5.pdf. ICLR 2016

Page 10: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Deep Learning: Training

5

Forward Propagation

Cat Person

Ground TruthNetwork Output

* Shihao Ji, S. V. N. Viswanathan, Nadathur Satish, Michael Anderson, and Pradeep Dubey. Blackout: Speeding up Recurrent NeuralNetwork Language Models with very large vocabularies. http://arxiv.org/pdf/1511.06909v5.pdf. ICLR 2016

Page 11: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Deep Learning: Training

5

Forward Propagation

Backward Propagation

Cat Person

Ground TruthNetwork Output

* Shihao Ji, S. V. N. Viswanathan, Nadathur Satish, Michael Anderson, and Pradeep Dubey. Blackout: Speeding up Recurrent NeuralNetwork Language Models with very large vocabularies. http://arxiv.org/pdf/1511.06909v5.pdf. ICLR 2016

Page 12: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Deep Learning: Training

5

Forward Propagation

Backward Propagation

Cat Person

Ground TruthNetwork Output

* Shihao Ji, S. V. N. Viswanathan, Nadathur Satish, Michael Anderson, and Pradeep Dubey. Blackout: Speeding up Recurrent NeuralNetwork Language Models with very large vocabularies. http://arxiv.org/pdf/1511.06909v5.pdf. ICLR 2016

Page 13: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Deep Learning: Training

5

Forward Propagation

Backward Propagation

Cat Person

Ground TruthNetwork Output

Complex Networks with billions of parameters can take days to train on a modern processor*

Hence, the need to reduce time-to-train using a cluster of processing nodes

* Shihao Ji, S. V. N. Viswanathan, Nadathur Satish, Michael Anderson, and Pradeep Dubey. Blackout: Speeding up Recurrent NeuralNetwork Language Models with very large vocabularies. http://arxiv.org/pdf/1511.06909v5.pdf. ICLR 2016

Page 14: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Machine Learning Continuum: Connected Factory

6

Server Class ComputeReal Time, Balancing Power/Performance/$

Scoring/Training? Scoring/Training Scoring/TrainingScoringData Collection

Predix Data center

Predix On-PremPrivate Cloud

Converged Controller / Small RT-Cloud / “Predix

Appliance”

Machine ControllerSensors

Mobile Device

Page 15: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Machine Learning Continuum: Self-Driving Cars

Data Center

7

Captured Sensor Data Analytics

In-Vehicle

Sensor Processing

Autonomous Driving Functions

Vehicle Simulation & Validation

Sensor Capture

Data

Vehicle Endpoint Management

Page 16: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Agenda

• Why do we need to scale machine learning

• What makes it hard to scale and how we are addressing it

• Real-world experience of an industry leader

• Hardware roadmap, software tools and frameworks update

• Summary

8

Page 17: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Compute Kernels

9

Page 18: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Compute Kernels

Fully connected layer

K M

WI O

9

Page 19: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Compute Kernels

Fully connected layer

K M

WI O

9

N = 2

Page 20: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Compute Kernels

Fully connected layer

K M

WI O

9

N = 2

M =

4

K =

3

K = 3N = 2

M =

4

N = 2

𝑾 ∈ 𝑹𝑴𝒙𝑲

Weights or model

𝑰 ∈ 𝑹𝑲𝒙𝑵

Input𝑶 ∈ 𝑹𝑴𝒙𝑵

Outputor activations

Page 21: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Compute Kernels

Fully connected layer

K M

WI O

Forward propagation: (M x K) * (K x N)

Backward propagation: (M x K)T * (M x N)

Weight update: (M x N) * (K x N)T

9

N = 2

M =

4

K =

3

K = 3N = 2

M =

4

N = 2

𝑾 ∈ 𝑹𝑴𝒙𝑲

Weights or model

𝑰 ∈ 𝑹𝑲𝒙𝑵

Input𝑶 ∈ 𝑹𝑴𝒙𝑵

Outputor activations

Page 22: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Parallelism Options

10

WWeights or model

IInput data

OOutput or activations

Page 23: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Parallelism Options

10

Data

WWeights or model

IInput data

OOutput or activations

Page 24: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Parallelism Options

10

DataModel

WWeights or model

IInput data

OOutput or activations

Page 25: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Parallelism Options

10

DataModelHybrid

WWeights or model

IInput data

OOutput or activations

Page 26: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Data/Model Parallelism at Scale

• General rule of thumb

- Use data parallelism when activations > weights

- Use model parallelism when weights > activations

• Implications of data and model parallelism

- Data parallelism at scale makes activations << weights

- Model parallelism at scale makes weights << activations

- Compute efficiency goes down due to skewed matrices

- Communication time dominates at scale

11

Page 27: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

• Hybrid parallelism to improve compute efficiency

- Partition across activations and weights to minimize skewed matrices

Input Weight activation

Weights

Activations Inputs

PartialActivation

Addressing the scaling challenge

Page 28: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Weight transfer

Activationtransfer

Node Group 1

• Hybrid parallelism to improve compute efficiency

- Partition across activations and weights to minimize skewed matrices

Input

• Node groups to improve communication efficiency

- Avoid global transfer of activations and weights via node groups

Activations transfer within a group

Weight transfer across groups

Weight activation

Weights

Activations Inputs

PartialActivation

Node Group 0

Addressing the scaling challenge

Page 29: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Reduce Scatter

Reduce the activations from layer N-1 and scatter at layer N

Common MPI collectives in DL• Reduce Scatter• AllGather• AllReduce• AlltoAll

Weights

Activations Inputs

PartialActivation

Layer N-1 Layer N

Communication Patterns in Deep Learning

Page 30: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

FORWARD PROPAGATION

BACK PROPAGATION

LAYER

1

LAYER

2

LAYER

N

All

red

uce

Activations (required immediately in next layer)

Updated weights (required during forward propagation of the corresponding layer)

Alltoall

All

red

uce

All

red

uce

ReduceScatter

AllgatherAlltoall Allreduce

Communication Patterns in Deep Learning … contd.

Page 31: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

FORWARD PROPAGATION

BACK PROPAGATION

LAYER

1

LAYER

2

LAYER

N

All

red

uce

Activations (required immediately in next layer)

Updated weights (required during forward propagation of the corresponding layer)

1. Optimized Collectives

Alltoall

All

red

uce

All

red

uce

ReduceScatter

AllgatherAlltoall Allreduce

Communication Patterns in Deep Learning … contd.

Page 32: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

FORWARD PROPAGATION

BACK PROPAGATION

LAYER

1

LAYER

2

LAYER

N

All

red

uce

Activations (required immediately in next layer)

Updated weights (required during forward propagation of the corresponding layer)

1. Optimized Collectives2. Compute

Communication Overlap

Alltoall

All

red

uce

All

red

uce

ReduceScatter

AllgatherAlltoall Allreduce

Communication Patterns in Deep Learning … contd.

Page 33: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

FORWARD PROPAGATION

BACK PROPAGATION

LAYER

1

LAYER

2

LAYER

N

All

red

uce

Activations (required immediately in next layer)

Updated weights (required during forward propagation of the corresponding layer)

1. Optimized Collectives2. Compute

Communication Overlap3. Smart Message and Task

Scheduling

Alltoall

All

red

uce

All

red

uce

ReduceScatter

AllgatherAlltoall Allreduce

Communication Patterns in Deep Learning … contd.

Page 34: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Scaling Deep Learning Communication Primitives

Deep learning specific optimizations result in3X speedup for the Allreduce collective (average for the message profile of16KB – 16MB floats)

15

4x

3x

2x

1x

0x

Speedup

Number of nodes

MPI ALLReduce performance on Intel® Xeon Phi™ Knights Landing

Average speedup over typical message sizes

2 4 8 16 32

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenterConfiguration: Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 96 GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port, Red Hat* Enterprise Linux 6.7, Intel® ICC version 16.0.2, Intel® MPI Library 5.1.3 for Linux.

Page 35: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Benefits of Multinode Optimizations

16

Vanilla MPI + Optimized Communication

Vanilla MPI + Optimized Communication

Higher is better Higher % of compute (green) is better

Number of nodes

0

Scaling efficiency

100

80

60

40

20

Overfeat-FAST: Scaling Efficiency

1 2 4 8 16 32

Number of nodes

0%

Percentage of Total

Time

100%

80%

60%

40%

20%

Overfeat-FAST: Compute Communication Breakdown

2 4 8 16 32 2 4 8 16 32

Communication Compute

Scaling efficiency without multi-node optimizations drops 1.3-2.1X for large node counts

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenterConfiguration: Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 96 GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port, Red Hat* Enterprise Linux 6.7, Intel® ICC version 16.0.2, Intel® MPI Library 5.1.3 for Linux, Intel® Optimized DNN Framework.

Page 36: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

1.0x2.0x

3.9x

7.3x

15.3x

26.7x

1 2 4 8 16 32

Scaling Training Time of a common Neural Network

Topology: AlexNet*

17

Speed up -Higher is better

Dataset: Large image database

Number of nodes

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.

Configurations: Up to 27x faster training on 32-node as compared to single-node based on AlexNet* topology workload (batch size = 1024) running one node Intel Xeon Phi processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM) in Intel® Server System LADMP2312KXXX41, 96 GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux* 6.7 (Santiago), Intel® ICC version 16.0.2, Intel® MPI Library 5.1.3 for Linux, running Intel® Optimized DNN Framework compared to identically configured 32-node system with Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port PCIe x16 connectors. Image database in 25TB Luster file system accessed using same Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port PCIe x16 connectors.. Contact your Intel representative for more information on how to obtain the binary. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

* https://github.com/soumith/convnet-benchmarks/blob/master/caffe/imagenet_winners/alexnet.prototxt

With convergence and I/O overhead included

Page 37: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Scaling efficiently popular neural network topologies

Dataset: Large image database

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance . *Other names and brands may be property of othersConfiguration: Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 96 GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port, Red Hat* Enterprise Linux 6.7, Intel® ICC version 16.0.2, Intel® MPI Library 5.1.3 for Linux, Intel® Optimized DNN Framework.18

Without convergence and I/O overhead

Page 38: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

The Virtuous Cycle of Compute

19

Training

Scoring

Page 39: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

The Virtuous Cycle of Compute

19

Training

Scoring

Page 40: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Delivering Trained Model to Edge Device

Accuracy

Throughput

Throughput(Images/sec)

Compression

20

Page 41: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Delivering Trained Model to Target Device †

Accuracy

Throughput(Images/sec)

Compression

10x1.1x

Sparsifying FC layers (e.g., Deep Compression*) -> mobile

21

1x (no accuracy loss)

† http://arxiv.org/abs/1608.01409* http://arxiv.org/abs/1510.00149Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. Configurations: Intel® Xeon® E5-2697 v4, with 2 socket, 18 cores per socket, 45MB LLC, running at 2.3GHz to train Cafffe/AlexNet. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

Page 42: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Delivering Trained Model to Target Device †

Balanced sparsifying of Conv and FC layers automotives

1.7x

Accuracy

Throughput(Images/sec)

Compression

5.7x

22

1x (no accuracy loss)

† http://arxiv.org/abs/1608.01409Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. Configurations: Intel® Xeon® E5-2697 v4, with 2 socket, 18 cores per socket, 45MB LLC, running at 2.3GHz to train Cafffe/AlexNet. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

Page 43: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Delivering Trained Model to Target Device †

9.0x2.6x

Accuracy

Throughput(Images/sec)

Compression

23

0.95x (5% accuracy loss)

† http://arxiv.org/abs/1608.01409Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter. Configurations: Intel® Xeon® E5-2697 v4, with 2 socket, 18 cores per socket, 45MB LLC, running at 2.3GHz to train Cafffe/AlexNet. For information on workload, see https://papers.nips.cc/paper/4824-Large image database-classification-with-deep-convolutional-neural-networks.pdf.

Page 44: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Agenda

• Why do we need to scale machine learning

• What makes it hard to scale and how we are addressing it

• Real-world applications

- Introducing Dr. Amir Khosrowshahi

• Hardware roadmap, software tools and frameworks update

• Summary

24

Page 45: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Introducing Dr. Amir Khosrowshahi

CTO, Nervana Systems

25

Page 46: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

About Nervana

A platform for machine intelligence

• Enable deep learning at scale

• Optimized from algorithms to silicon

26

x

Page 47: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Healthcare

Application Areas

27

Online Services Automotive Energy

Agriculture Finance

Page 48: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Nervana model

Natural Language

Processing

Image Classification

Speech Recognition

Object Localization

Video Indexing

DL

Deep Learning as a Core Technology

28

“Google Brain” model

Photos Maps

Machine Translation

Voice Search

Self-Driving

Car

Ad Targeting

DL

Page 49: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Cloud

Nervana Cloud

Data

Images Video Text

Speech Tabular Time Series

Import Build

Train Deploy

Page 50: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Neon: Nervana Python* Deep Learning Library

• User-friendly, extensible, abstracts parallelism

• Support for many deep learning models

• Interface to Nervana cloud

• Supports multiple backends

• Integrates with large cloud datastores

• Core routines written in assembler

Page 51: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Agenda

• Why do we need to scale machine learning

• What makes it hard to scale and how we are addressing it

• Real-world experience of an industry leader

• Hardware roadmap, software tools and frameworks update

• Summary

31

Page 52: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Intel® Xeon Phi™ Processor Family for Performance

• Up to ~6 SGEMM TFLOPs1

per socket

• 1.38x2 better scaling efficiency resulting in lower time to train for multi-node

• Eliminates add-in card PCIe* offload bottleneck and utilization constraints

• Integrated Intel® Omni-Path fabric (dual-port; 50 GB/s) increases price-performance and reduces communication latency for deep learning networks

• Binary-compatible with Intel® Xeon® processors

• Open standards, libraries and frameworks

Breakthrough Highly-Parallel Performance

Removes Barriers through Integration

Better Programmability

Enables shorter time to train

1.. Up to 6 SP TFLOPS based on the Intel Xeon Phi processor peak theoretical single-precision performance (FLOPS = cores x clock frequency x single-precision floating-point operations per second per cycle).2. Up to 38% better scaling efficiency at 32-nodes claim based on GoogLeNet deep learning image classification training topology using a large image database comparing one node Intel Xeon Phi processor 7250 (16 GB, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, DDR4 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat* Enterprise Linux 6.7, Intel® Optimized DNN Framework with 87% efficiency to unknown hosts running 32 each NVIDIA Tesla* K20 GPUs with a 62% efficiency (Source: http://arxiv.org/pdf/1511.00175v2.pdf showing FireCaffe* with 32 each NVIDIA Tesla* K20s (Titan Supercomputer*) running GoogLeNet* at 20x speedup over Caffe* with 1 each K20).

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter

*Other brands and names are the property of others.32

Page 53: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Intel® Xeon Phi™ Processor Family for Performance

• Up to ~6 SGEMM TFLOPs1

per socket

• 1.38x2 better scaling efficiency resulting in lower time to train for multi-node

• Eliminates add-in card PCIe* offload bottleneck and utilization constraints

• Integrated Intel® Omni-Path fabric (dual-port; 50 GB/s) increases price-performance and reduces communication latency for deep learning networks

• Binary-compatible with Intel® Xeon® processors

• Open standards, libraries and frameworks

Breakthrough Highly-Parallel Performance

Removes Barriers through Integration

Better Programmability

Enables shorter time to train

1.. Up to 6 SP TFLOPS based on the Intel Xeon Phi processor peak theoretical single-precision performance (FLOPS = cores x clock frequency x single-precision floating-point operations per second per cycle).2. Up to 38% better scaling efficiency at 32-nodes claim based on GoogLeNet deep learning image classification training topology using a large image database comparing one node Intel Xeon Phi processor 7250 (16 GB, 1.4 GHz, 68 Cores) in Intel® Server System LADMP2312KXXX41, DDR4 96GB DDR4-2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat* Enterprise Linux 6.7, Intel® Optimized DNN Framework with 87% efficiency to unknown hosts running 32 each NVIDIA Tesla* K20 GPUs with a 62% efficiency (Source: http://arxiv.org/pdf/1511.00175v2.pdf showing FireCaffe* with 32 each NVIDIA Tesla* K20s (Titan Supercomputer*) running GoogLeNet* at 20x speedup over Caffe* with 1 each K20).

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter

*Other brands and names are the property of others.32

For an exciting new Intel® Xeon Phi™ roadmap update for machine learning/AI:Please attend Intel EVP Diane Bryant's Keynote tomorrow, Aug 17, 9am

Page 54: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

33

Better performance in Deep Neural Network workloads with Intel® Math Kernel Library (Intel® MKL)

Page 55: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

33

Better performance in Deep Neural Network workloads with Intel® Math Kernel Library (Intel® MKL)

0

5

10

15

20

25

30

Intel® Xeon® E5-2699 v4 Intel® Xeon® E5-2699 v4 Intel® Xeon® E5-2699 v4 Intel® Xeon Phi 7250

Out-of-the-box +Intel MKL 11.3.3 +Intel MKL 2017 +Intel MKL 2017

Pe

rfo

rma

nce

sp

ee

du

p

Caffe*/AlexNet single node training performance

2.1x

2x

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance . *Other names and brands may be property of othersConfigurations: • 2 socket system with Intel® Xeon Processor E5-2699 v4 (22 Cores, 2.2 GHz,), 128 GB memory, Red Hat* Enterprise Linux 6.7, BVLC Caffe, Intel Optimized Caffe framework, Intel® MKL 11.3.3, Intel® MKL 2017• Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat* Enterprise Linux 6.7, Intel® Optimized Caffe framework, Intel® MKL 2017All numbers measured without taking data manipulation into account.

5.8x

24x

TM

Page 56: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

34

Better performance in Deep Neural Network workloads with Intel® Math Kernel Library (Intel® MKL)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance . *Other names and brands may be property of othersConfigurations: • 2 socket system with Intel® Xeon® Processor E5-2699 v4 (22 Cores, 2.2 GHz,), 128 GB memory, Red Hat* Enterprise Linux 6.7, BVLC Caffe, Intel Optimized Caffe framework, Intel® MKL 11.3.3, Intel® MKL 2017• Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat* Enterprise Linux 6.7, Intel® Optimized Caffe framework, Intel® MKL 2017All numbers measured without taking data manipulation into account.

7.5x

31x

TM

*

Page 57: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Intel Deep Learning Software Stack and Timeline

35

*Other names and brands may be claimed as property of others.

Xeon Xeon PhiIntel MKL is SW building block to extract max Intel HW performance and provide common interface to all Intel accelerators.

Intel® Math Kernel Library

(Intel® MKL)

Page 58: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Intel Deep Learning Software Stack and Timeline

35

*Other names and brands may be claimed as property of others.

Xeon Xeon Phi FPGAIntel MKL is SW building block to extract max Intel HW performance and provide common interface to all Intel accelerators.

Intel MKL-DNN is an open source IA optimized DNN APIs, combined with Intel® MKL and build tools designed for scalable, high-velocity integration with ML/DL frameworks.

Includes:

• Open Source implementations of new DNN functionality included in MKL 2017 Beta, new algorithms ahead of MKL releases

• IA optimizations contributed by community

Targeted release: Q3 2016Intel® Math

Kernel Library(Intel® MKL)

Intel® MKL-DNN

Page 59: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Intel Deep Learning Software Stack and Timeline

35

*Other names and brands may be claimed as property of others.

Xeon Xeon Phi FPGAIntel MKL is SW building block to extract max Intel HW performance and provide common interface to all Intel accelerators.

Popular Deep Learning frameworks

Intel MKL-DNN is an open source IA optimized DNN APIs, combined with Intel® MKL and build tools designed for scalable, high-velocity integration with ML/DL frameworks.

Includes:

• Open Source implementations of new DNN functionality included in MKL 2017 Beta, new algorithms ahead of MKL releases

• IA optimizations contributed by community

Targeted release: Q3 2016Intel® Math

Kernel Library(Intel® MKL)

Intel® MKL-DNN

Deep Learning Frameworks

Caffe

• Multi-Node scaling for Knights Landing: Caffe* by EoY and 1H’17 in other frameworks

Page 60: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Intel Deep Learning Software Stack and Timeline

35

*Other names and brands may be claimed as property of others.

Xeon Xeon Phi FPGAIntel MKL is SW building block to extract max Intel HW performance and provide common interface to all Intel accelerators.

Popular Deep Learning frameworks

Intel MKL-DNN is an open source IA optimized DNN APIs, combined with Intel® MKL and build tools designed for scalable, high-velocity integration with ML/DL frameworks.

Includes:

• Open Source implementations of new DNN functionality included in MKL 2017 Beta, new algorithms ahead of MKL releases

• IA optimizations contributed by community

Targeted release: Q3 2016

Tools to accelerate design, training and deployment of deep learning solutions Intel Deep Learning Tools

Targeted release: Q3’2016

Intel® Math Kernel Library

(Intel® MKL)

Intel® MKL-DNN

Deep Learning Frameworks

Caffe

• Multi-Node scaling for Knights Landing: Caffe* by EoY and 1H’17 in other frameworks

• Intel Deep Learning Tools with support for model compression by end of 2016

Page 61: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Call to action

• Machine learning is the key enabler for a new virtuous cycle of compute triggered by explosion of digital data and ubiquitous connectivity

• It can vastly expand the reach of computing for applications like self-driving, agriculture, health and manufacturing

• Help machine learning unlock the true potential of AI

• Consider the full, end-to-end pipeline when you think about your AI needs.

• Try Intel MKL, Intel optimized frameworks & Intel Xeon Phi

Page 62: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Summary

• Machine learning must scale out to bring down the training time of weeks/days to days/hours

• Machine learning compute infrastructure must be both performant & productive for developers, and leverage the efficiency of cloud

• Scaling distributed machine learning is challenging as it pushes the limits of available data/model parallelism and internode communication

• Intel’s new deep learning tools -- with the upcoming integration of Nervana cloud stack -- are designed to hide/reduce the complexity of strong scaling time-to-train and model deployment tradeoffs on resource-constrained edge devices without compromising the performance need

37

Page 63: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Related Tech Sessions

For more information on machine learning at Intel: intel.com/machinelearning

A PDF of this presentation is available is available from our Technical Session Catalog: www.intel.com/idfsessionsSF. This URL is also printed on the top of Session Agenda Pages in the Pocket Guide.

ANATS01: Deep Learning Frameworks and Optimization Paths on Intel® ArchitectureBy Andres Rodriguez, Panchumarthy, Ravi, and Tom “Elvis” Jones, Amazon

ANATS03: Enabling an End to End Architecture for Autonomous VehiclesBy Jack Weast

ANATS05: How to Parallelize Neural Networks (xNNs) for Intel® Xeon Phi™By Nadathur R. Satish

Page 64: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, Xeon, Xeon Phi, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

© 2015 Intel Corporation.

Page 65: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Page 66: Scaling to Meet the Growing Needs of Artificial ...pittnuts.com/wp-content/uploads/2016/08/ANATI01-SF... · Deep Learning: Training 5 Forward Propagation Backward Propagation Cat

Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel's products is highly variable and could differ from expectations due to factors including changes in business and economic conditions; consumer confidence or income levels; the introduction, availability and market acceptance of Intel's products, products used together with Intel products and competitors' products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel's gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel's stock repurchase program could be affected by changes in Intel's priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel's cash flows or changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel's results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel's ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel's results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel's results is included in Intel's SEC filings, including the company's most recent reports on Form 10-Q, Form 10-K and earnings release.

Rev. 4/14/15