pushing the limits of ai with in-network...

25
© 2019 Mellanox Technologies | Confidential 1 APNET 2019 Gil Bloch Pushing the Limits of AI with In-Network Computing

Upload: others

Post on 12-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 1

APNET 2019Gil Bloch

Pushing the Limits of AI with In-Network Computing

Page 2: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 2

Mellanox Accelerates Leading HPC and AI SystemsWorld’s Top 3 Supercomputers

Summit CORAL SystemWorld’s Fastest HPC / AI System9.2K InfiniBand Nodes

Sierra CORAL System#2 USA Supercomputer 8.6K InfiniBand Nodes

1 2Wuxi Supercomputing CenterFastest Supercomputer in China41K InfiniBand Nodes

3

Page 3: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 3

Autonomous vehicle generates 4000GByte per day

SONAR~10-100KB Per/Sec

CAMERA~20-40MB Per/sec

GPS~50KB Per/Sec

▪ Data will grow by a factor of 10 over the next decade to 163 Zeta Bytes in 2025 (source: IDC)

▪ Faster Data processing requires faster Interconnect speeds

RADAR~10-100KB Per/Sec

Light Detection & Ranging~10-70MB Per/Sec

Data is Growing Faster Than Ever

Page 4: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 4

Neural Networks Complexity Growth

2014 2015 2016 2017

DeepSpeech DeepSpeech-2DeepSpeech-3

30X

2012 2013 2014 2015 2016

AlexNet GoogleNetResNet

Inception-V2

350X

Inception-V4

Image Recognition

SpeechRecognition

Complexity = GOPS X Bandwidth

Page 5: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 5

MoreData

BetterModels

FasterInterconnect

GPUs

CPUs

FPGAs

Storage

Mellanox Unleashes the Power of Artificial IntelligenceEnabling World-Leading Artificial Intelligence Solutions

ASIC

Page 6: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 6

The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Must Wait for the DataCreates Performance Bottlenecks

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

GPU

CPU

GPU

CPU

Onload Network In-Network Computing

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Analyze Data as it Moves!Higher Performance and Scale

Page 7: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 7

An Application Example – Pizza Processing

▪ Order Pizza▪ Call (or use Pizza application)

▪ CPU 1 – prepare Pizza▪ Tomato sauce, Cheese, Peperoni…

▪ CPU 1 – Put in the oven▪ And now we wait…

▪ CPU 1 – Pack and send▪ Network (Pizza Delivery)

CPU-Centric (Onload)

Must Wait for the DataCreates Performance Bottlenecks

CPU 1 – Pizza GenerationCPU 2 – Pizza Consumption

GPU

CPU

GPU

CPU

Onload Network

GPU

CPU

CPU

GPU

Page 8: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 8

What if…

Page 9: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 9

Data Centric Architecture to Overcome Latency Bottlenecks

CPU-Centric (Onload) Data-Centric (Offload)

Communications Latencies of 30-40us

Intelligent Interconnect Paves the Road to Exascale Performance

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Communications Latenciesof 3-4us

Page 10: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 10

In-Network Computing to Enable Data-Centric Data Centers

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPUDirect

RDMA

Scalable Hierarchical Aggregation and

Reduction Protocol

NVMeOver

Fabrics

Page 11: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 11

Accelerating All Levels of HPC/AI Frameworks

GPUDirect

RDMA

Network

Framework

Communication

Framework

Application

Framework ▪ Data Analysis

▪ Configurable Logic

▪ SHARP – Data Aggregation

▪ MPI Tag Matching

▪ MPI Rendezvous

▪ SNAP - Software Defined Virtual Devices

▪ Network Transport Offload

▪ RDMA and GPU-Direct

▪ SHIELD (Self-Healing Network)

▪ Adaptive Routing and Congestion Control

Connectivity

Framework▪ Multi-Host

▪ Enhanced Topologies

▪ Dragonfly+

Page 12: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 12

The Need for Speed

Page 13: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 13

Matching Inter and Intra Node Bandwidth

Page 14: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 14

Mellanox Accelerates TensorFlow 1.5

100G is a Must For Large Scale Models 6.5X Faster Training

with 100G

2.5X

6.5X

Page 15: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 15

RDMA and GPUDirect

Page 16: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 16

10X Higher Performance with GPUDirect™ RDMA

▪Accelerates HPC and Deep Learning performance

▪ Lowest communication latency for GPUs

GPUDirect™ RDMA

Page 17: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 17

Mellanox Accelerates NVIDIA NCCL 2.0

50% PerformanceImprovement

with NVIDIA® DGX-1 across32 NVIDIA Tesla V100 GPUsUsing InfiniBand RDMAand GPUDirect™ RDMA

Page 18: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 18

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Page 19: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 19

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

▪ Reliable Scalable General Purpose Primitive

▪ Applicable to Multiple Use-cases in ML/HPC

▪ Scalable High Performance Collective Offload

DataAggregated

AggregatedResult

Aggregated Result

Data

Host Host Host Host Host

SwitchSwitch

Switch

Page 20: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 20

SHARP AllReduce Performance Advantages (128 Nodes)

SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical

Aggregation and Reduction Protocol

Page 21: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 21

SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

Reduction Protocol

Page 22: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 22

SHARP Performance – Application (OSU)

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

Source: Prof. DK Panda, Ohio State University

Page 23: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 23

Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance

SHARP Accelerates AI Performance

The CPU in a parameter server becomes the bottleneck

Page 24: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 24

▪ Increase System Performance▪ Better Scalability▪ Reduces amount of data traversing the network

InfiniBand SHARP Advantage for Deep Learning

16%

11%

System Configuration: Intel E5-2650V4, 12 cores @ 2.2GHz, 30M L2 cache, 9.6GT QPI, 256GB RAM: 16 x 16 GB DDR4, NVIDIA P100 GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed), RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0

Scalable Performance for Distributed AI

Page 25: Pushing the Limits of AI with In-Network Computingconferences.sigcomm.org/events/apnet2019/slides/... · 2019-08-26 · Mellanox Accelerates Leading HPC and AI Systems World’s Top

© 2019 Mellanox Technologies | Confidential 25

Thank You