odsc workshop - distributed tensorflow on hops

@ODSC

Distributed DeepLearning on Hops Robin Andersson

Fabio BusoRISE SICS AB

Logical Clocks AB

London | October 12th-14th 2017

Please register on odsc.hops.site

Big Data and AI

3

Why you are here

4From: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

Deep Learning with GPUs (on Hops)

5

Separate Clusters for Big Data and ML

6

*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

7

I need estimates for the ROI on these candidate features in our product

We are on it. Need to first sync up with IT and engineering

Data Science in Enterprises Today

7

Data Science Team

CTO

88

IT

Collaboration Overhead is HighPrepare Dataset samples for Data Science

Data Science Team Data Engineering

We need access to these Datasets

DataLake

Ok

1. Update Access Rights

GPU Cluster2. Copy Dataset Samples (some time later)

3. Run experiments

99

How it should be

IT

Data Science Data Engineering

Here’s someone who can help you out

I need help to work on a project for the CTO

Project

Conda Env, CPU/Storage Quotas, Self-Service, GDPR

Kafka Topics

DataLake

GPU Cluster

Elasticsearch

HopsWorks Data Platform

10

HopsWorks

11

Kafka Topic

Project X Project Y

Project Data

HopsFS

12

Open Source fork of Apache HDFS

16x faster than HDFS

37x more capacity than HDFS

SSL/TLS instead of Kerberos

Scale Challenge Winner (2017)

https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

HopsYARN GPUs

13

Native GPU support in YARN - world first

Implications

- Schedule GPUs just like memory or CPU- Exclusive allocation (no GPU-sharing)- Distributed, scale-out Machine Learning

TensorFlow first-class support in Hops

14

Run in

Spark ExecutorTensorFlow code

0.003 learning rate, 0.3 dropout0.001 learning rate, 0.5 dropout

0.002 learning rate, 0.7 dropout



HopsUtil

Library for launching TensorFlow jobs

Manages the TensorBoard lifecycle

Helper Functions for Spark/Kafka/HDFS/etc

15

HopsUtil - Read data

from hopsutil import hdfs

dataset=path.join(hdfs.project_path(),‘Resources/mnist/tfr/train’)

files=tf.gfile.Glob(path.join(dataset,‘part-*’))

file_queue=tf.train.string_input_producer(files, … )

16

17

HopsUtil - initialize Pydoop HDFS API

Pydoop HDFS API is a rich api that provides operations such as

- Connecting to an HDFS instance- General file operations (create, read, write)- Get information on files, directories, fs

Connect to HopsFS using HopsUtil:


pydoop_handle = hdfs.get()17

HopsUtil - TensorBoard

from hopsutil import tensorboard

[...]

logdir = tensorboard.logdir()

sv = tf.train.Supervisor(is_chief=True, logdir=logdir, [...], save_model_secs=60)

18

HopsUtil - Hyperparameter searching

from hopsutil import tflauncher

def training(learning_rate, dropout):[....]

params = {‘learning_rate': [0.001, 0.002, 0.003], 'dropout': [0.3, 0.5, 0.7]}tflauncher.launch(spark, training, params)

19

HopsUtil - Logging


[...]

while not sv.should_stop() and step < steps:

hdfs.log(sess.run(accuracy))

[...]

20

DEMO TIME!TensorFlow tour on HopsWorks

21

22

How to get started

23

How to get started (2)

24

How to get started (3)

25

TensorBoard

26

Dela - Search for interesting datasets

27

Dela - Import a Dataset

Dela

28

p2p network of Hops clusters

Find and share interesting datasets

Exploits unused bandwidth and backs off in case of network traffic

The Challenge

29

http://timdettmers.com/2017/08/31/deep-learning-research-directions

Experiment Time and Research Productivity

● Minutes, Hours:○ Interactive analysis!

● 1-4 days○ Interactivity replaced by

many parallel experiments● 1-4 weeks

○ High value experiments only● >1 month

○ Don’t even try!

30

Solution: Go distributed

31

State-of-the-Art in GPU Hardware

32

Nvidia DGX-1

33

SingleRoot Commodity GPU Cluster Computing

34

The budget side

35

Commodity Server*

➔ 10 Nvidia GTX 1080Ti◆ 11 GB Memory

➔ 256 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband➔ SingleRoot PCI Complex

10 x Commodity Server = 150K Euro

Nvidia DGX-1

➔ 8 Nvidia Tesla V100◆ 16 GB Memory

➔ 512 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband

➔ NVLink

Price per DGX-1 = 150K Euro

*www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems/

36

Distributed TensorFlow

Distribute TensorFlow graph

Workers / Parameter server

Synchronous / Asynchronous

Model / Data parallelism

Problems:- Clusterspec- Manually starting process

37

Introducing TensorFlowOnSpark by YAHOO!

Wrapper for Distributed TensorFlow

- Creates clusterspec automatically!- Runs on a Hadoop/Spark cluster- Starts the workers/parameter servers automatically- First attempt at “scheduling” GPUs- Simplifies the programming model- Manages TensorBoard- “Migrate all existing TF programs with < 10 lines of code”

37

TensorFlowOnSpark architecture

38 HopsFs

Spark Driver

Spark ExecutorParameter

Server

Spark Executor

Worker

Spark Executor

Worker

Scaling TensorFlowOnSpark

39

Near linear scaling up to 8 workers

*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

TensorFlowOnSpark on Hops

40

41

Our improved TensorFlowOnSpark - 1

Problem:Use RAM (1GPU = 27GB RAM) as a proxy to ‘schedule’

GPUs.Solution:

Hops provides GPU scheduling!

41

42


Problem:A worker will wait until GPUs become available,

potentially forever!Solution:

GPU scheduling ensures that the GPU is only allocated for that particular worker.

42

43


Problem:Each parameter server allocates 1 GPU, this is a waste!

Solution:Only workers may use GPUs

43

44

Conversion guide: TensorFlowOnSpark

TFCluster.run(spark, training_fun, num_executors, num_ps…)

Add PySpark and TensorFlowOnSpark imports

Create your own FileWriter

Replace tf.train.Server() with TFNode.start_cluster_server()

Full conversion guide for Distributed TensorFlow to TensorFlowOnSparkhttps://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

44

https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

DEMO TIME!Distributed TF on Spark

45

Distributed Stochastic Gradient Descent

46

SDG with Data Parallelism (Single Host)

47

Facebook: Scaling Synchronous SDGJune 2017: training time on ImageNet from 2 weeks to 1 hour

➔ ~90% scaling efficiency going from 8 to 256 GPUs

Learning rate heuristic/ Warm up phase/ Large batches

48Paper: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

All-Reduce

49

N GPUs, K parametersComm. cost: 2(N-1) * K/N

Independent from # GPUs

overlap communication and computation

Drawback: Synchronous communication

From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Baidu All-Reduce - Performance scaling

50From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Horovod - Better than Baidu All-Reduce?

51

Fork of Baidu All-Reduce

Improvements

1. Replaced Baidu ring-allreduce with NVIDIA NCCL2. Tensor Fusion3. Support for larger models4. Pip package5. Horovod Timeline

5252

Migrating existing code to run on Horovod

1. Run hvd.init()

2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. Local rank maps to unique GPU for the process.

3. Wrap optimizer in hvd.DistributedOptimizer. 4. Add hvd.BroadcastGlobalVariablesHook(0) to

broadcast initial variable states from rank 0 to all other processes.

Horovod/Baidu AllReduce

53

Provide as a service on HopsWorks

Integration of All-Reduce with a Hadoop cluster- Use YARN to schedule GPUs

Scheduling of homogeneous GPUs and network- YARN supports node labels

HopsFS authentication/authorization

TensorBoard lifecycle management as in HopsUtil

The teamActive contributors:Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.

Past contributors:Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Aruna Kumari Yedurupaka, Tobias Johansson, Roberto Bampi, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid.

54

www.hops.iogithub.com/hopshadoop

@hopshadoop

55

odsc workshop - distributed tensorflow on hops

Technology