how to achieve high-performance, scalable and distributed ...€¦ · network based computing...

47
How to Achieve High-Performance, Scalable and Distributed Training on Modern HPC Systems? Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at HPCAC-AI Stanford Conference (April ’20) by Follow us on https://twitter.com/mvapich

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

How to Achieve High-Performance, Scalable and Distributed Training on Modern HPC Systems?

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Talk at HPCAC-AI Stanford Conference (April ’20)

by

Follow us on

https://twitter.com/mvapich

Page 2: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 2Network Based Computing Laboratory

Understanding the Deep Learning Resurgence

Adopted from: http://www.deeplearningbook.org/contents/intro.html

• Deep Learning (DL) is a sub-set of Machine Learning (ML)

– Perhaps, the most revolutionary subset!

– Feature extraction vs. hand-crafted features

• Deep Learning– A renewed interest and a lot of hype!

– Key success: Deep Neural Networks (DNNs)

– Everything was there since the late 80s except the “computability of DNNs”

AI

Machine Learning

Deep Learning

Examples:

MLPs, DNNs,

Examples:

Logistic Regression

Page 3: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 3Network Based Computing Laboratory

• Modern and efficient hardware enabled

– Computability of DNNs – impossible in the past!

– GPUs – at the core of DNN training

– CPUs – catching up fast

• Availability of Datasets

– MNIST, CIFAR10, ImageNet, and more…

• Excellent Accuracy for many application areas

– Vision, Machine Translation, and several others...

Deep Learning in the Many-core Era

Courtesy: A. Canziani et al., “An Analysis of Deep Neural Network Models for Practical Applications”, CoRR, 2016.

0

2000

4000

6000

8000

10000

2 GTX 580 DGX-2

Min

utes

to T

rain

AlexNet

~500X in 5 years

Page 4: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 4Network Based Computing Laboratory

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!

Page 5: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 5Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 200Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

K - ComputerSunway TaihuLightSummit Sierra

Page 6: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 6Network Based Computing Laboratory

• Deep Learning has two major tasks1. Training of the Deep Neural Network

2. Inference (or deployment) that uses a trained DNN

• DNN Training– Training is a compute/communication intensive process – can take days to

weeks

– Faster training is necessary!

• Faster training can be achieved by– Using Newer and Faster Hardware – But, there is a limit!

– Can we use more GPUs or nodes?• The need for Parallel and Distributed Training

Key Phases of Deep Learning

Page 7: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 7Network Based Computing Laboratory

• Scale-up: Intra-node Communication

– Many improvements like:• NVIDIA cuDNN, cuBLAS, NCCL, etc.

• CUDA 9 Co-operative Groups

• Scale-out: Inter-node Communication

– DL Frameworks – most are optimized for single-node only

– Distributed (Parallel) Training is an emerging trend

• OSU-Caffe – MPI-based

• Microsoft CNTK – MPI/NCCL2

• Google TensorFlow – gRPC-based/MPI/NCCL2

• Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)

Scale-up and Scale-out

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

gRPC

Hadoop

MPIMKL-DNN

Desired

NCCL2

Page 8: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 8Network Based Computing Laboratory

Holistic Evaluation is Important!!

• My framework is faster than your framework!

• This needs to be understood in a holistic way.

• Performance depends on the entire execution environment (the full stack)

• Isolated view of performance is not helpful

A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8.

MKL/MKL-DNN

Page 9: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 9Network Based Computing Laboratory

How to efficiently scale-out a

Deep Learning (DL) framework and take advantage of heterogeneous

High Performance Computing (HPC) resources?

Broad Challenge: Exploiting HPC for Deep Learning

Page 10: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 10Network Based Computing Laboratory

1. What are the fundamental issues in designing DL frameworks?

– Memory Requirements

– ComputationRequirements

– Communication Overhead

2. Why do we need to support distributed training?

– To overcome the limits of single-node training

– To better utilize hundreds of existing HPC Clusters

Research Challenges to Exploit HPC Technologies

InfiniBand GPUCPU

CNTK

Gradient AggregationModel Propagation Forward

Backward

Deep Learning and Machine Learning Frameworks

Caffe/OSU-Caffe Caffe2 TensorFlow MXNet

Communication Runtimes to support Distributed Training

HPC Platforms

Major Computation and Communication Phases in DL Frameworks

1

2

Page 11: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 11Network Based Computing Laboratory

3. What are the new design challenges brought forward by DL frameworks for Communication runtimes?

– Large Message CollectiveCommunication and Reductions

– GPU Buffers (CUDA-Awareness)

4. Can a Co-design approach help in achieving Scale-up and Scale-out efficiently?

– Co-Design the support at Runtime level and Exploit it at the DL Framework level

– What performance benefits can be observed?

– What needs to be fixed at the communication runtime layer?

Research Challenges to Exploit HPC Technologies (Cont’d)

CUDA-Awareness

InfiniBand GPUCPU

Large-message Collectives

CNTK

Point-to-Point

Operations

Gradient AggregationModel Propagation Forward

Backward

Deep Learning and Machine Learning Frameworks

Caffe/OSU-Caffe Caffe2 TensorFlow MXNet

Communication Runtimes (MPI/NCCL/Gloo/MLSL)

HPC Platforms

Major Computation and Communication Phases in DL Frameworks

3

4 Co-Design Opportunities

Page 12: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 12Network Based Computing Laboratory

• MPI-driven Deep Learning– CPU-based Deep Learning

– GPU-based Deep Learning

• Co-designing Deep Learning Stacks with High-Performance MPI

• Out-of-core DNN training

• Exploiting Hybrid (Data and Model) Parallelism

• Use-Case: AI-Driven Digital Pathology

Multiple Approaches taken up by OSU

Page 13: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 13Network Based Computing Laboratory

Data Parallel Deep Learning and MPI Collectives

MPI_Bcast (GPU 0)

packed_comm_buff

L1

L2

..Ln

F

L1

L2

..Ln

L1

L2

..Ln

L1

L2

..Ln

Params

GPU

0 Params

GPU

1 Params

GPU

2 Params

GPU

3

Gradients

1. Data Propagation

2. Forward Backward

Pass

3. Gradient Aggregatio

n

B F B F B F B

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

ApplyUpdates

MPI_Reduce (GPU 0)

Loop {}• Major MPI Collectives

involved in Designing distributed frameworks

• MPI_Bcast – required for DNN parameter exchange

• MPI_Reduce – needed for gradient accumulation from multiple solvers

• MPI_Allreduce – use just one Allreduce instead of Reduce and Broadcast

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

Page 14: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 14Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,075 organizations in 89 countries

– More than 732,000 (> 0.7 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (November ‘19 ranking)

• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 14th, 570,020 cores (Nurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade Partner in the 5th ranked TACC Frontera System

Page 15: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 15Network Based Computing Laboratory

0

100000

200000

300000

400000

500000

600000

700000

800000Se

p-04

Mar

-05

Sep-

05

Mar

-06

Sep-

06

Mar

-07

Sep-

07

Mar

-08

Sep-

08

Mar

-09

Sep-

09

Mar

-10

Sep-

10

Mar

-11

Sep-

11

Mar

-12

Sep-

12

Mar

-13

Sep-

13

Mar

-14

Sep-

14

Mar

-15

Sep-

15

Mar

-16

Sep-

16

Mar

-17

Sep-

17

Mar

-18

Sep-

18

Mar

-19

Sep-

19

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GDR

2.0

b

MV2

-MIC

2.0

MV2

-GDR

2.3

.2

MV2

-X2.

3rc

3

MV2

Virt

2.2

MV2

2.3

.3O

SU IN

AM 0

.9.5

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads

Page 16: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 16Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family (HPC and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-Awareness

Remote Memory Access

I/O andFile Systems

FaultTolerance

Virtualization Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM

Page 17: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 17Network Based Computing Laboratory

• gRPC– Officially available and supported

– Open-source – can be enhanced by others

– Accelerated gRPC (add RDMA to gRPC)

• gRPC+X– Use gRPC for bootstrap and rendezvous

– Actual communication is in “X”

– XMPI, Verbs, GPUDirect RDMA (GDR), etc.

• No-gRPC– Baidu – the first one to use MPI Collectives for TF

– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)

Data Parallel Training with TensorFlow

*Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation ”, CCGrid ’19.

Page 18: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 18Network Based Computing Laboratory

MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training

MVAPICH2-X for CPU-Based Training

MVAPICH2-GDR for GPU-Based Training

Horovod

TensorFlow PyTorch MXNet

ML/DL Applications

Page 19: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 19Network Based Computing Laboratory

MVAPICH2 Software Family (CPU-Based Deep Learning)High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications

Page 20: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 20Network Based Computing Laboratory

Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning

0

200

400

600

800

28 56 112 224

Exec

utio

n Ti

me

(s)

No. of Processes

Intel MPIMVAPICH2MVAPICH2-XPMEM

CNTK AlexNet Training (B.S=default, iteration=50, ppn=28)

20%

9%

• CPU-based training of AlexNet neural network using ImageNet ILSVRC2012 dataset

• Advanced XPMEM-based designs show up to 20% benefits over Intel MPI (IMPI) for CNTK DNN training using All_Reduce

• The proposed designs show good scalability with increasing system size

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018

Available since MVAPICH2-X 2.3rc1 release

Page 21: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 21Network Based Computing Laboratory

Distributed TensorFlow on TACC Frontera (2,048 CPU nodes)• Scaled TensorFlow to 2048 nodes on

Frontera using MVAPICH2 and IntelMPI

• MVAPICH2 and IntelMPI give similar performance for DNN training

• Report a peak of 260,000 images/sec on 2,048 nodes

• On 2048 nodes, ResNet-50 can be trained in 7 minutes!

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Page 22: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 22Network Based Computing Laboratory

Scaling DL Frameworks using MVAPICH2-X on TACC Frontera

• On single node, TensorFlow (TF) is 8% better than MXNet

• TF (tf_cnn_benchmark) is 2.13x better than PyTorch

• TensorFlow is 1.7x better than MXNet

• TF (Keras) gives better performance compared to PyTorch and MXNet.

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Page 23: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 23Network Based Computing Laboratory

MVAPICH2 Software Family (GPU-Based Deep Learning) High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications

Page 24: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 24Network Based Computing Laboratory

At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

Page 25: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 25Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.3 Releases

• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from

GPU device buffers• Unified memory

Page 26: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 26Network Based Computing Laboratory

0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3

MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.85us11X

Page 27: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 27Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)• Optimized designs in MVAPICH2-GDR offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)

1

10

100

1000

10000

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR-2.3.3 NCCL-2.5

~2.5X better

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2

05

101520253035404550

8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR-2.3.3 NCCL-2.5

~4.7X better

C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, Accepted to be Presented.

Page 28: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 28Network Based Computing Laboratory

MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale• Optimized designs in MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

0

1

2

3

4

5

6

32M 64M 128M 256M

Band

wid

th (G

B/s)

Message Size (Bytes)

Bandwidth on 1,536 GPUs

MVAPICH2-GDR-2.3.3 NCCL 2.5

1.7X better

050

100150200250300350400450

4 16 64 256 1K 4K 16K

Late

ncy

(us)

Message Size (Bytes)

Latency on 1,536 GPUs

MVAPICH2-GDR-2.3.3 NCCL 2.5

1.6X better

Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

0

2

4

6

8

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.3 OpenMPI 4.0.1

NCCL 2.5 MVAPICH2-GDR-2.3.3

1.7X better

C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020. Accepted to be Presented.

Page 29: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 29Network Based Computing Laboratory

Scalable TensorFlow using Horovod and MVAPICH2-GDR• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (16 Volta GPUs)

01000200030004000500060007000

1 2 4 8 16

Imag

e pe

r sec

ond

Number of GPUs

NCCL-2.5 MVAPICH2-GDR-2.3.3

9% higher

Platform: Nvidia DGX-2 system, CUDA 9.2

0

20

40

60

80

100

1 2 4 8 16

Scal

ing

Effic

ienc

y (%

)

Number of GPUs

NCCL-2.5 MVAPICH2-GDR-2.3.3

Scaling Efficiency =Actual throughput

Ideal throughput at scale× 100%

C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020. Accepted to be Presented.

Page 30: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 30Network Based Computing Laboratory

Distributed TensorFlow on ORNL Summit (1,536 GPUs)• ResNet-50 Training using

TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!

• 1,281,167 (1.2 mil.) images

• Time/epoch = 3.6 seconds

• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

Thou

sand

s

Number of GPUs

NCCL-2.5 MVAPICH2-GDR 2.3.3

Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2

*We observed errors for NCCL2 beyond 96 GPUs

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images

Page 31: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 31Network Based Computing Laboratory

• LLNL/Lassen POWER9 CPU with 4 V100 GPUs/node

• PyTorch is becoming a very important DL framework

• Scaling PyTorch models with Horovod is simple

• MVAPICH2-GDR provides better performance and scalability compared to NCCL2

Scaling PyTorch on LLNL/Lassen using MVAPICH2-GDR

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

1 2 4 8 16 32 64 128

Imag

es/s

ec (h

ighe

r is b

ette

r)

No. of GPUs

MVAPICH2-GDR NCCL2

Platform: The Lassen Supercomputer (#10 on Top500.org) – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1

Page 32: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 32Network Based Computing Laboratory

• RTX 5000 are NVIDIA’s GPUs targeted for data centers

• Different from GTX series– Supports GPUDirect RDMA (GDR)

– Supports GDRCOPY

• MVAPICH2-GDR offers good performance and reasonable scaling

• Scaling is not as good as Lassen because– Nodes are connected with IB FDR

– No NVLink between GPUs (only PCIe)

Early Exploration of MXNet using MVAPICH2-GDR on TACC Frontera RTX GPU Nodes

0500

100015002000250030003500400045005000

1 2 4 8 16 32 64

Imag

es/s

econ

d (h

ighe

r is b

ette

r)

No. of GPUs

MVAPICH2-GDR

Platform: TACC Frontera-RTX – 4 NVIDIA RTX 5000 GPUs per node, CUDA 10.1

Page 33: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 33Network Based Computing Laboratory

• TACC Longhorn is like the LLNL/Lassen system (POWER9 with 4 V100 GPUs/node)

• Some restrictions to keep in mind

– TensorFlow 2.1 on POWER9 only built with CUDA 10.2

• Early results are promising– Good scaling up to 256 GPUs

Early Exploration of TensorFlow 2.1 on TACC/Longhorn using MVAPICH2-GDR

0

10000

20000

30000

40000

50000

60000

70000

1 2 4 8 16 32 64 128 256

Imag

es/s

ec (h

ighe

r is b

ette

r)No. of GPUs

MVAPICH2-GDR

Platform: The Longhorn System at TACC – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.2

Page 34: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 34Network Based Computing Laboratory

• MPI-driven Deep Learning– CPU-based Deep Learning

– GPU-based Deep Learning

• Co-designing Deep Learning Stacks with High-Performance MPI

• Out-of-core DNN training

• Exploiting Hybrid (Data and Model) Parallelism

• Use-Case: AI-Driven Digital Pathology

Multiple Approaches taken up by OSU

Page 35: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 35Network Based Computing Laboratory

• What are the fundamental bottlenecks in the existing DL frameworks?

• How to design a Scalable and Distributed Framework?

– Achieve both Scale-up and Scale-out efficiently

• What are the new requirements and expectations for Communication Runtimes?

• Can a Co-design approach help in achieving better training performance?

– Can a DL framework like Caffe be co-designed with an MPI runtime?

• What is the impact of the co-design?

– What performance benefits can be observed? At what levels?

S-Caffe: Co-designing HPC and DL

A. A. Awan, K. Hamidouche, J.M. Hashmi, DK Panda, “S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters”, ACM PPoPP ’17

Page 36: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 36Network Based Computing Laboratory

• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across

multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use caseOSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

Page 37: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 37Network Based Computing Laboratory

Out-of-core Training via CUDA Unified Memory• Large DNNs cannot be trained on GPUs due to memory limitation!

– ResNet-50 -- current frameworks only allow small batch sizes

– Next-generation models like Neural Machine Translation (NMT), AmoebaNet, etc.

• Ridiculously large consists of billions of parameters

– Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?

• General intuition is that managed allocations “will be” slow!

– The proposed OC-Caffe (Out-of-Core Caffe) exploits the potential of CUDA Unified Memory.

• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7

A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18

Page 38: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 38Network Based Computing Laboratory

• Data-Parallelism– only for models that fit the memory

• Out-of-core models– Deeper model Better accuracy but more memory required!

• Model parallelism can work for out-of-core models!

• Key Challenges– Model Partitioning is difficult for application

programmers

– Finding the right partition (grain) size is hard

– cut at which layer and why?

– Developing a practical system for model-parallelism• Redesign DL Framework or create additional layers?

• Existing Communication middleware or extensions needed?

HyPar-Flow: Hybrid and Model Parallelism for CPUs

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (Accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf

Page 39: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 39Network Based Computing Laboratory

• ResNet-1001 with variable batch size

• Approach: – 48 model-partitions for 56 cores

– 512 model-replicas for 512 nodes

– Total cores: 48 x 512 = 24,576

• Speedup– 253X on 256 nodes

– 481X on 512 nodes

• Scaling Efficiency– 98% up to 256 nodes

– 93.9% for 512 nodes

Out-of-core Training with HyPar-Flow (512 nodes on TACC Frontera)

481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera)

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (Accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf

Page 40: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 40Network Based Computing Laboratory

• The field of Pathology (like many other medical disciplines) is moving Digital

• Traditionally, a pathologist reviews a slide and carries out diagnosis based on prior knowledge and training

• Experience matters

• Can Deep Learning to be used to train thousands of slides and – Let the computer carry out the diagnosis

– Narrow down the diagnosis and help a pathologist to make a final decision

• Significant benefits in– Reducing diagnosis time

– Making pathologists productive

– Reducing health care cost

AI-Driven Digital Pathology

Page 41: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 41Network Based Computing Laboratory

• Pathology whole slide image (WSI) – Each WSI = 100,000 x 100,000 pixels

– Can not fit in a single GPU memory

– Tiles are extracted to make training possible

• Two main problems with tiles– Restricted tile size because of GPU memory limitation

– Smaller tiles loose structural information

• Can we use Model Parallelism to train on larger tiles to get better accuracy and diagnosis?

• Reduced training time significantly– 7.25 hours (1 node, 4 GPUs) ->

27 mins (32 nodes, 128 GPUs)

Exploiting Model Parallelism in AI-Driven Digital Pathology

Courtesy: https://blog.kitware.com/digital-slide-archive-large-image-and-histomicstk-open-source-informatics-tools-for-management-visualization-and-analysis-of-digital-histopathology-data/

Page 42: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 42Network Based Computing Laboratory

• Scalable distributed training is getting important

• Requires high-performance middleware designs while exploiting modern interconnects

• Provided a set of solutions to achieve scalable distributed training– MPI (MVAPICH2)-driven solution with Horovod for TensorFlow, PyTorch and

MXNet– Optimized collectives for CPU-based training– CUDA-aware MPI with optimized collectives for GPU-based training– Out-of-core training and Hybrid Parallelism

• Will continue to enable the DL community to achieve scalability and high-performance for their distributed training

Conclusions

Page 43: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 43Network Based Computing Laboratory

• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:

– Help and guidance with installation of the library

– Platform-specific optimizations and tuning

– Timely support for operational issues encountered with the library

– Web portal interface to submit issues and tracking their progress

– Advanced debugging techniques

– Application-specific optimizations and tuning

– Obtaining guidelines on best practices

– Periodic information on major fixes and updates

– Information on major releases

– Help with upgrading to the latest release

– Flexible Service Level Agreements • Support being provided to Lawrence Livermore National Laboratory (LLNL) and KISTI, Korea

Commercial Support for MVAPICH2, HiBD, and HiDL Libraries

Page 44: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 44Network Based Computing Laboratory

• High-performance solution for distributed training for your complex AI problems

• Features:– Integrated package with TensorFlow, PyTorch, MXNet, Horovod, and

MVAPICH2 MPI libraries– Targeted for both CPU-based and GPU-based Deep Learning Training– Integrated profiling support across the stacks– Support for x86 and OpenPOWER platforms– Support for InfiniBand, RoCE and NVLink Interconnects– Out-of-the-box optimal performance– One-click deployment and execution

• Send an e-mail to [email protected] for free trial!!

X-ScaleAI Product

Page 45: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 45Network Based Computing Laboratory

Funding AcknowledgmentsFunding Support by

Equipment Support by

Page 46: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 46Network Based Computing Laboratory

Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– C.-H. Chu (Ph.D.)

– J. Hashmi (Ph.D.)

– A. Jain (Ph.D.)

– K. S. Kandadi (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– R. Rajachandrasekar (Ph.D.)

– D. Shankar (Ph.D.)– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

– X. Lu

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– Kamal Raj (M.S.)

– K. S. Khorassani (Ph.D.)

– P. Kousha (Ph.D.)

– A. Quentin (Ph.D.)

– B. Ramesh (Ph.D.)

– S. Xu (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– S. Marcarelli

– A. Ruhela– J. Vienne

Current Post-doc– M. S. Ghazimeersaeed

– K. Manian

Current Students (Undergraduate)– V. Gangal (B.S.)

– N. Sarkauskas (B.S.)

Past Research Specialist– M. Arnold

Current Research Scientist– A. Shafi

– H. Subramoni

– Q. Zhou (Ph.D.)

– H. Wang

Page 47: How to Achieve High-Performance, Scalable and Distributed ...€¦ · Network Based Computing Laboratory HPCAC-AI-Stanford (April ‘20) 17 • gRPC – Officially available and supported

HPCAC-AI-Stanford (April ‘20) 47Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/