jianhuili - ieee web hostingsite.ieee.org/scv-cas/files/2019/10/2019li.pdf · intel optimized deep...

Jianhui Li

Principal Engineer,

Intel Architecture, Graphics and Software

Intel

Sep 2019

Pillars of Deep Learning

Data

DL Chips

Algorithm

Software Impact on ResNet-50 Performance

Baseline

50x

285x

2S Intel® Xeon® Scalable Processor (Skylake)July 2017 July 2017 Skylake launch February 2019

vs. Baseline vs. Baseline

Performance results are based on testing as of February 2019 and may not reflect all publicly available security updates. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.See slide 28 for detailed configuration: 50x inference throughput improvement with Intel® Optimizations for Caffe ResNet-50 on Intel® Xeon® Platinum 8180 Processor in Feb 2019 compared to performance at launch in July 2017. Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804. For more complete information about compiler optimizations, see our Optimization Notice.

Demo link

Deep Learning Trends

INT8

FP32

Training

Inference

Deep Learning Steps

Data Precision

Topologies

Computer Vision Natural Language Processing

RECOMMENDATION SYSTEMS RE-INFORCEMENT LEARNING

Frameworks

ResNet-50, Squeezenets, Mobilenet GNMT, Bert

NCF, Wide & Deep MiniGO

Diverse and rapidly evolving

BFloat16

Computer Vision

100

10High intensity

Compute

Data Access

Low intensity

Inference, BS = 1

Natural Language Processing

10010

Low intensity

Compute

Data Access

High intensity

Inference, BS = 1

Recommendation System

Low compute

Low intensity

100

10

For Disclosure under NDA Only

High intensity

Inference, BS = 100

Compute

Data Access

Reinforcement Learning

High intensity

100

10

Low intensity

Inference, BS = 64

Data Access

Compute

Intel Optimized Deep Learning Frameworks and Toolkit

OpenVINO/DLDT

SEE ALSO: Machine Learning Libraries for Python (Scikit-learn, Pandas, NumPy), R (Cart, randomForest, e1071), Distributed (MlLib on Spark, Mahout)

*ON

* * *

Intel® Math Kernel Libraryfor Deep Neural Networks (MKL-DNN)

Intel® Machine Learning Scaling Library (Intel® MLSL)

Computational Primitives Communication Primitives

Get the Best AI Performance Workload

Parallel nodes/Cores/IPs Vector/Matrix ACC memory/cache locality

Exploit parallelism with

OpenMP, MPI

Heterogeneous compute,

Load balancing

Multi-node optimization

Vectorization

Tensorization

Data Layout

Blocking

Streaming

NUMA

Optimizations for Libraries and Frameworks

VPU Microarchitecture

• Accelerator designed for computer vision

• Low power, high throughput, heterogeneous multi-core

• Neural Compute Engine

• VLIW processor cores

VPU Optimizations

on-chip mem

DMA

Neural Compute Engine (NCE)

SHAVE DSP

RISC CPU

Conv/FC/Relu/Pooling

Load/Store

Activation

Sum/Softmax/Tanh

Weight

WeightInput Tensor

Output Tensor

X =

X =

WeightInput Tensor

Output Tensor

Tensorization

CPU Microarchitecture

1. General computing

2. Vector and matrix compute for deep learning tensor operations

3. Low-latency and large cache support cross-op data reuse

4. Larger memory capacity for big model and multi model instances

CascadeLake: Intel® AVX-512 VNNI instruction for FP32 and int8CooperLake: Intel® AVX-512 instruction for Bfloat16

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

AVX512

L2/L3

Optimizing Intel® MKL-DNN

Optimizations: Intel® AVX-512 vectorization, data reuse, parallelization

Inference with INT8

• Powered by Intel® DL-Boost (VNNI) and MKL-DNN library

• Improves wide ranges of computer vision models by 3-4x with <0.5% accuracy loss

• Fine-grained channel-wise quantization for weight is critical for depth-wise/group convolution

Rel

ativ

e Th

rou

ghp

ut

accuracy

Performance results are based on testing as of Sep 2019 and may not reflect all publicly available security updates. See configuration disclosure at slide 29 for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.For more complete information about compiler optimizations, see our Optimization Notice.

higher is better

Multi-instance inference

• Online Deep Learning serving predicts the incoming sample immediately

• Multi-instance runs multiple copies of the same DL model simultaneously

• Multi-instance inference minimizes the latency for buffering data samples

0246810121416

0%

20%

40%

60%

80%

100%

120%

1 instancex 28 sample






ResNext-101 Multi-Instance Througput vs. Bufferingn Time

0246810121416

0%

20%

40%

60%

80%

100%

120%







ResNet-50 Multi-Instance Througput vs. Buffering Time

Performance results are based on testing as of Sep 2019 and may not reflect all publicly available security updates. See configuration disclosure at slide 29 for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804. For more complete information about compiler optimizations, see our Optimization Notice.

Higher is Better

Lower is Better

Lower is Better

Higher is Better

Rel

ativ

e Th

rou

ghp

ut

Rel

ativ

e La

ten

cy (

esti

mat

ed)

Rel

ativ

e Th

rou

ghp

ut

Rel

ativ

e La

ten

cy (

esti

mat

ed)

Full-stack Optimization for Deep Learning

3

Framework optimization +Library optimization

Application level Optimization + Framework optimization +

Library optimization

0

2

4

6

MXNet 1.5.0 Baseline MXNet 1.5.0 Optimized

BERTSQuAD inference latency

speedup

0

1

2

3

4

5

6

7

8

Tensorflow v1.14 baseline Tensorflow v1.14 Optimized

Minigotraining time speedup

Framework optimization

0

20

40

60

80

Pytorch v1.1 baseline Pytorch v1.1 optimized

DLRMtraining throughput

speedup

Performance results are based on testing as of Sep 2019 and may not reflect all publicly available security updates. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804. For more complete information about compiler optimizations, see our Optimization Notice.

Higher is Better Higher is Better

Higher is Better

Rel

ativ

e Th

rou

ghp

ut

Rel

ativ

e Th

rou

ghp

ut

Rel

ativ

e Th

rou

ghp

ut

See configuration disclosure on DLRM measurement at slide 29 for details. For more information on BERT, see https://medium.com/apache-mxnet/optimization-for-bert-inference-performance-on-cpu-3bb2413d376cFor more information on Minigo, see https://software.intel.com/en-us/blogs/2019/07/10/intel-cpu-excels-in-mlperf-reinforcement-learning-training

ResNet-50 Deep Learning Inference on Xeon®

Higher is Better

3

7878 7844

0

2000

4000

6000

8000

10000

Intel® Xeon® Platinum 9282(70 int8 Tflops)

NVIDIA* Tesla V100*(125 fp16 Tflops)

ResNet-50 Deep Learning Inference Throughput (images/sec)

ResNet-50 Deep Learning Inference Throughput (images/sec)

For more information on the ResNet-50 performance, visit https://software.intel.com/en-us/articles/intel-cpu-outperforms-nvidia-gpu-on-resnet-50-deep-learning-inference

Performance results are based on testing as of May 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.For more complete information about compiler optimizations, see our Optimization Notice.

1x 1x

91.8%95.3

91.2%94.5%

94.2%

Inception-v3 (Imagenet-2012) Resnet-50 (Imagenet-2012)

Rel

ativ

e th

rou

ghp

ut

1 node (1 x C5.18xlarge)

DL Training Multinode ScalingHigher is Better

3

Intel® Xeon® Platinum 8124M CPU @ 3.00GHz

Nodes1 132 64 32 64 128

Performance results are based on testing as of February 2019 and may not reflect all publicly available security updates. See configuration disclosure at slide 29 for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.For more complete information about compiler optimizations, see our Optimization Notice.

For detailed configuration, see slide 29

23https://software.intel.com/en-us/articles/ai-helps-with-skin-cancer-screening

Client: Doctor Hazel*, a skin cancer screening service.

Challenge: Skin cancer has reached epidemic proportions in much of the world. A simple test is needed to perform initial screening on a wide scale to encourage individuals to seek treatment when necessary.

Solution: Doctor Hazel, a skin cancer screening service powered by artificial intelligence (AI) that operates in real time, and relies on an extensive library of images to distinguish between skin cancer and benign lesions, making it easier for people to seek professional medical advice.

Result

A real time solution for Skin Cancer Screening

“Intel provides both hardware and software needs in artificial intelligence, from training to deployment. As a startup, it's relatively inexpensive to build up the prototype. The Intel® Movidius™ Neural Compute Stick costs about $USD 79 and it allows AI to run in real time. We used the Intel® Movidius™ Software Development Kit (SDK), which proved extremely useful for this project."

Peter Ma, Intel® Software innovator, cofounder of Doctor Hazel

Intel® Distribution of Openvino™ toolkit

24

*

“We chose Intel® Xeon® Scalable processors for AI due to their speed and cost benefits over Nvidia* GPUs. The time spent shuffling data back and forth to the GPUs negated their performance gains, and after working with Intel to optimize our code, I can now cut back my new server purchases and significantly increase the productivity of my existing servers.”

Ariel Pisetzky, VP of Information Technology, Taboola

Result

2.5x speed upIn throughput (recommendations/sec) over baseline testing using Intel® Xeon® Scalable 8180 processors

Customer: Taboola*, the world’s largest content recommendation engine, serving over 360 billion recommendations to over 1 billion unique visitors monthly.

Challenge: Taboola was evaluating GPU P4 for TensorFlow Inference. They preferred to keep their workload on Xeon as it tightly integrates with their web serving, but the Intel-optimized TensorFlow for their workload (TensorFlow serving) performed many times slower than the standard non-MKL TensorFlow from Google.

Solution: Intel and Taboola engineers collaborated to optimize TensorFlow Serving performance for Taboola model to deliver a significant speed-up over baseline. Taboola chose Intel® Xeon® Scalable processors over P4 due to better integration with their web services. Taboola has now deployed Intel-optimized TensorFlow serving on newly purchased Intel Xeon Scalable servers.

Performance results are based on testing as of August 6, 2019 and may not reflect all publicly available security updates. See configuration disclosure at slide 28 for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. For more complete information about compiler optimizations, see our Optimization Notice.

25https://software.intel.com/en-us/articles/building-large-scale-image-feature-extraction-with-bigdl-at-jdcom

Client: JD.Com*, second largest online retailer in China, with approximately 25 million registered users.

Challenge: Building deep learning applications such as image similarity search on GPU cluster was costly & complex. Technical issues included high latency when downloading graphic data from Apache Hbase* & complicated data pre-processing in GPU environment.

Solution: Switched from GPU to CPU cluster. Using Apache Spark* with BigDL, running on Intel® Xeon® processors. Intel delivered an image detection & extraction pipeline. BigDL used it to build deep learning models for image recognition & feature extraction.

Result“The high scalability, high performance, and ease of use of BigDL-based solutions make it easy for JD to deal with the massive and ever-growing number of images. As a result, JD is in the process of upgrading the GPU solution to BigDL on the Intel® Xeon® processor solution...”

*

“We found BigDL using Intel® Xeon® processors as the best platform for production deployment of our SSD (single-shot multi-box detector) solution on our Spark cluster”Dennis Weng, VP of JD.comHead of JD Big Data Platform Division

Pillars of Deep Learning

Data

DL SW +

DL Chips

Algorithm

Notices and Disclaimers• Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether

referenced data are accurate.

• For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

• Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, orfrom the OEM or retailer.

• The cost reduction scenarios described are intended to enable you to get a better understanding of how the purchase of a given Intel based product, combined with anumber of situation-specific variables, might affect future costs and savings. Circumstances will vary and there may be unaccounted-for costs related to the use anddeployment of a given product. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs or cost reduction.

• Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intelmicroprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, oreffectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intelmicroprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User andReference Guides for more information regarding the specific instruction sets covered by this notice.

• No computer system can be absolutely secure.

• Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVXinstructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximumturbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

• Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.

• © 2019 Intel Corporation. Intel, Xeon, Optane, 3D Xpoint, DL Boost, AVX, the Intel logo, and Xeon logos are trademarks of Intel Corporation in the U.S. and/or othercountries.

Intel Corporation. *Other names and brands may be claimed as the property of others.

• Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Anydifferences in your system hardware, software or configuration may affect your actual performance.

• INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS ISGRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATIONINCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHERINTELLECTUAL PROPERTY RIGHT.

Configurations for AI SW improvement Journey

• Baseline measured in July 2017: Tested by Intel as of 2/21/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32 GB/ 2666 MHz), BIOS: SE5C620.86B.00.01.0015.110720180833 (ucode: 0x200004d), CentOS Linux-7.3.1611-Core, 3.10.0-862.11.6.el7.x86_64, ICC 17.0.2 20170213, GCC 5.4.0 20160609, BVLC caffe: 1.0.0 (commit hash: 99bd99795dcdf0b1d3086a8d67ab1782a8a08383), ResNet-50: https://github.com/intel/caffe/blob/master/models/default_resnet_50/train_val.prototxt, FP32, BS=64, synthetic Data

• 50x inference throughput improvement in July 2017: Tested by Intel as of July 11th 2017:2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, synthetic dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (ResNet-50). Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“ vs Tested by Intel as of 2/21/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32 GB/ 2666 MHz), BIOS: SE5C620.86B.00.01.0015.110720180833 (ucode: 0x200004d), CentOS Linux-7.3.1611-Core, 3.10.0-862.11.6.el7.x86_64, ICC 17.0.2 20170213, GCC 5.4.0 20160609, BVLC caffe: 1.0.0 (commit hash: 99bd99795dcdf0b1d3086a8d67ab1782a8a08383), ResNet-50: https://github.com/intel/caffe/blob/master/models/default_resnet_50/train_val.prototxt, FP32, BS=64, synthetic Data vs Tested by Intel as of 2/21/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32 GB/ 2666 MHz), BIOS: SE5C620.86B.00.01.0015.110720180833 (ucode: 0x200004d), CentOS Linux-7.3.1611-Core, 3.10.0-862.11.6.el7.x86_64, ICC 17.0.2 20170213, GCC 5.4.0 20160609, BVLC caffe: 1.0.0 (commit hash: 99bd99795dcdf0b1d3086a8d67ab1782a8a08383), ResNet-50: https://github.com/intel/caffe/blob/master/models/default_resnet_50/train_val.prototxt, FP32, BS=64, synthetic Data

• 285X inference throughput improvement in Feb 2019: Tested by Intel as of 2/21/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 192 GB (12 slots/ 16GB/ 2666 MHz), BIOS: SE5C620.86B.00.01.0015.110720180833 (ucode: 0x200004d), CentOS 7.5, 3.10.0-693.el7.x86_64, Intel® SSD DC S4500 SERIES SSDSC2KB480G7 2.5’’ 6Gb/s SATA SSD 480G, , Deep Learning Framework: Intel® Optimization for Caffe version: 1.1.3 (commit hash: 7010334f159da247db3fe3a9d96a3116ca06b09a) , ICC version 18.0.1, MKL DNN version: v0.17 (commit hash: 830a10059a018cd2634d94195140cf2d8790a75a, model: https://github.com/intel/caffe/blob/master/models/intel_optimized_models/benchmark/resnet_50/deploy.prototxt, BS=64, synthetic Data, 2 instance/2 socket, Datatype: INT8 vs Tested by Intel as of 2/21/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32 GB/ 2666 MHz), BIOS: SE5C620.86B.00.01.0015.110720180833 (ucode: 0x200004d), CentOS Linux-7.3.1611-Core, 3.10.0-862.11.6.el7.x86_64, ICC 17.0.2 20170213, GCC 5.4.0 20160609, BVLC caffe: 1.0.0 (commit hash: 99bd99795dcdf0b1d3086a8d67ab1782a8a08383), ResNet-50: https://github.com/intel/caffe/blob/master/models/default_resnet_50/train_val.prototxt, FP32, BS=64, synthetic Data vs Tested by Intel as of 2/21/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32 GB/ 2666 MHz), BIOS: SE5C620.86B.00.01.0015.110720180833 (ucode: 0x200004d), CentOS Linux-7.3.1611-Core, 3.10.0-862.11.6.el7.x86_64, ICC 17.0.2 20170213, GCC 5.4.0 20160609, BVLC caffe: 1.0.0 (commit hash: 99bd99795dcdf0b1d3086a8d67ab1782a8a08383), ResNet-50: https://github.com/intel/caffe/blob/master/models/default_resnet_50/train_val.prototxt, FP32, BS=64, synthetic Data

• 2.5X throughput improvement for Taboola customer workload: Tested by Intel as of August 6, 2019. System Configuration: Intel® Xeon® Platinum 8180 CPU @ 2.50GHz; 2 Sockets, 56 cores/socket, Hyper-threading ON, Turbo boost OFF, CPU Scaling governor “performance”; RAM: Samsung 192 GB DDR4@2666MHz (16GB DIMMS x 12), BIOS: Intel SE5C620.86B.0X.01.0007.062120172125, Hard Disk: INTEL SSDSC2BX01 1.5TB, OS: CentOS Linux release 7.5.1804 (Core) (3.10.0-862.9.1.el7.x86_64), Baseline TensorFlow-Serving: TensorFlow-Serving r1.9, Intel Optimized TensorFlow-Serving: TensorFlow-Serving r1.9 + Intel® MKL-DNN + Optimizations, MKLML.

29

Configurations for AI SW optimizations on IA

• Int8 inference performance with Intel® Optimizations for Caffe on computer vision models Tested by Intel as of 04/28/2019. 2 socket Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz, 28 cores HT On Turbo ON, Total Memory 192GB (12 slots/ 16GB/ 2933MHz), BIOS: SE5C620.86B.02.01.0008.031920191559, CentOS Linux release 7.5.1804, kernel 3.10.0-862.el7.x86_64, SSD 1x sda INTEL SSDSC2BB480G7 SSD 480GB, 6X INTEL SSDSC2KG038T8 SSD 18TB , Deep Learning Framework: Intel® Optimization for Caffe version: 1.1.5 (commit hash: 1a77a6665386b8a60e603c5dc33ec5aea1ef48a4) , ICC version 18.0.1, MKL DNN version: v0.17 (commit hash: 830a10059a018cd2634d94195140cf2d8790a75a calibrator tool: https://github.com/intel/caffe/blob/master/scripts/calibrator.py Topology Link: DenseNet121 https://github.com/shicai/DenseNet-Caffe/blob/master/DenseNet_121.prototxt DenseNet169 https://github.com/shicai/DenseNet-Caffe/blob/master/DenseNet_169.prototxt Faster-RCNN https://github.com/intel/caffe/blob/master/models/intel_optimized_models/faster-rcnn/pascal_voc/VGG16/faster_rcnn_end2end/faster_rcnn_int8_full_conv.prototxt FCN https://github.com/developmentseed/caffe-fcn/blob/master/fcn-8s/deploy.prototxt Inception-V3 https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/inceptionv3_int8_acc.prototxt Inception-ResNet-V2 https://github.com/soeaver/caffe-model/blob/master/cls/inception/deploy_inception-resnet-v2.prototxt MobileNet-V1 https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/mobilenet_v1_int8_full_conv_acc.prototxt MobileNet-V2 https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/mobilenet_v2_int8_full_conv_acc.prototxt SSD-MobileNetV1 https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/ssd_mobilenet_int8_acc.prototxt ResNet101-V1 https://github.com/KaimingHe/deep-residual-networks/blob/master/prototxt/ResNet-101-deploy.prototxtResNet101-V2 https://github.com/soeaver/caffe-model/blob/master/cls/resnet-v2/deploy_resnet101-v2.prototxt ResNet152-V1 https://github.com/KaimingHe/deep-residual-networks/blob/master/prototxt/ResNet-152-deploy.prototxtResNet50-V1https://github.com/intel/caffe/blob/master/models/intel_optimized_models/resnet50_v1/resnet50_int8_acc.prototxt ResNet50-V1.5https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/resnet50_fb_int8_acc_good.prototxt ResNext50-32x4d https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/resnext50_int8_full_conv_acc.prototxt R-FCN https://github.com/intel/caffe/blob/master/models/intel_optimized_models/rfcn/pascal_voc/ResNet-101/rfcn_end2end/rfcn_int8_full_conv.prototxt SqueezeNet-V0 https://github.com/DeepScale/SqueezeNet/blob/master/SqueezeNet_v1.0/deploy.prototxt SqueezeNet-V1 https://github.com/DeepScale/SqueezeNet/blob/master/SqueezeNet_v1.1/deploy.prototxt SSD-VGG16 https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/ssd_vgg16_int8.prototxt VGG-16 https://github.com/intel/caffe/blob/master/models/default_vgg_16/train_val.prototxt VGG-19 https://github.com/intel/caffe/blob/master/models/default_vgg_19/train_val.prototxt Yolo-V2 https://github.com/intel/caffe/blob/master/models/intel_optimized_models/int8/yolov2_int8_full_conv.prototxt

• Multi-instance inference performance with Intel® Optimizations for Pytorch/Caffe2 ResNet-50 and Resnext-101: Tested by Intel as of 8/30/2019. 2 socket Intel® Xeon® Platinum 8280 Processor, 28 cores HT On Turbo ON Total Memory 192GB (12 slots/ 16GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0271.120720180605 (ucode: 0x4000013), CentOS 7.6 LTS, kernel 3.10.0-957.el7.x86_64 , SSD 480G Intel4610 series, Deep Learning Framework: Intel® Optimization for Pytorch Caffe2 (commit 60c4e74e49770ca7204b812006926ac0778d30df) , GCC version 4.8.5, MKL DNN version: v0.19 (commit hash: 41bee20d7eb4a67feeeeb8d597b3598994eb1959), ideep(commit: fc6b17848ff9e8b9eea57f21e2cc3c0c61a4a15c), model: https://github.com/intel/optimized-models/tree/v1.0.7/pytorch, BS=1 per core, syntheticData, 1-28instance/1 socket, Datatype: FP

• DLRM performance with Intel® Optimizations for Pytorch: Tested by Intel as of 8/21/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32 GB/ 2666 MHz), BIOS: SE5C620.86B.00.01.0015.110720180833 (ucode: 0x200004d), CentOS Linux-7.3.1611-Core, 3.10.0-862.11.6.el7.x86_64, GCC 7.3, Pytorch: commit-hash: e86d99ae8803fcad6036eeeb85d3a7c893f65400, DLRM: https://github.com/facebookresearch/dlrm/blob/master/bench/dlrm_s_benchmark.sh, FP32, BS=2048, Random Data input. Optimization with 3 Pytorch PRs: https://github.com/pytorch/pytorch/pull/23055https://github.com/pytorch/pytorch/pull/23057 https://github.com/pytorch/pytorch/pull/24385 with env setting KMP_BLOCKTIME=1

• Multi-node Scaling performance with Intel® Optimizations for Caffe: Tested by Intel as of 8/21/2018. Amazon EC2 C5.18xlarge instance, node # 1/32/64/128, 2 socket Intel® Xeon® Platinum 8124M CPU @ 3.0G, 18 cores HT On Turbo ON Network config : Amazon Elastic Network Adapter (ENA), 25 Gbps of aggregate network bandwidth, Placed the all instances in the same placement group, Intel® Optimization for Caffe version 1.1.1, Intel® Optimization for Caffe ResNet-50 and Inception-v3 version available from https://github.com/intel/caffe/tree/master/models/intel_optimized_models, ResNet-50 Batch size: 128 x # of node, Inception-v3 Batch size: 64 x # of node, Dataset: Imagenet, ILSVRC 2012, Restnet-50: JPEG resized 256x256, Inception-v3: JPEG resized 320x320, MKLDNN (commit: 464c268e544bae26f9b85a2acb9122c766a4c396), MKL mklml_lnx_2018.0.1.20171227, MLSL l_mlsl_2018.0.003, Compiler gcc/g++: 4.8.5, icc/icpc: 17.0.5

jianhuili - ieee web hostingsite.ieee.org/scv-cas/files/2019/10/2019li.pdf · intel optimized deep...

Documents