odsc workshop - distributed tensorflow on hops
TRANSCRIPT
![Page 1: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/1.jpg)
@ODSC
Distributed DeepLearning on Hops Robin Andersson
Fabio BusoRISE SICS AB
Logical Clocks AB
London | October 12th-14th 2017
![Page 2: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/2.jpg)
Please register on odsc.hops.site
![Page 3: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/3.jpg)
Big Data and AI
3
![Page 4: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/4.jpg)
Why you are here
4From: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
![Page 5: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/5.jpg)
Deep Learning with GPUs (on Hops)
5
![Page 6: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/6.jpg)
Separate Clusters for Big Data and ML
6
*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!
![Page 7: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/7.jpg)
7
I need estimates for the ROI on these candidate features in our product
We are on it. Need to first sync up with IT and engineering
Data Science in Enterprises Today
7
Data Science Team
CTO
![Page 8: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/8.jpg)
88
IT
Collaboration Overhead is HighPrepare Dataset samples for Data Science
Data Science Team Data Engineering
We need access to these Datasets
DataLake
Ok
1. Update Access Rights
GPU Cluster2. Copy Dataset Samples (some time later)
3. Run experiments
![Page 9: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/9.jpg)
99
How it should be
IT
Data Science Data Engineering
Here’s someone who can help you out
I need help to work on a project for the CTO
Project
Conda Env, CPU/Storage Quotas, Self-Service, GDPR
Kafka Topics
DataLake
GPU Cluster
Elasticsearch
![Page 10: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/10.jpg)
HopsWorks Data Platform
10
![Page 11: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/11.jpg)
HopsWorks
11
Kafka Topic
Project X Project Y
Project Data
![Page 12: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/12.jpg)
HopsFS
12
Open Source fork of Apache HDFS
16x faster than HDFS
37x more capacity than HDFS
SSL/TLS instead of Kerberos
Scale Challenge Winner (2017)
https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
![Page 13: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/13.jpg)
HopsYARN GPUs
13
Native GPU support in YARN - world first
Implications
- Schedule GPUs just like memory or CPU- Exclusive allocation (no GPU-sharing)- Distributed, scale-out Machine Learning
![Page 14: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/14.jpg)
TensorFlow first-class support in Hops
14
Run in
Spark ExecutorTensorFlow code
0.003 learning rate, 0.3 dropout0.001 learning rate, 0.5 dropout
0.002 learning rate, 0.7 dropout
Spark ExecutorTensorFlow code
Spark ExecutorTensorFlow code
![Page 15: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/15.jpg)
HopsUtil
Library for launching TensorFlow jobs
Manages the TensorBoard lifecycle
Helper Functions for Spark/Kafka/HDFS/etc
15
![Page 16: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/16.jpg)
HopsUtil - Read data
from hopsutil import hdfs
dataset=path.join(hdfs.project_path(),‘Resources/mnist/tfr/train’)
files=tf.gfile.Glob(path.join(dataset,‘part-*’))
file_queue=tf.train.string_input_producer(files, … )
16
![Page 17: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/17.jpg)
17
HopsUtil - initialize Pydoop HDFS API
Pydoop HDFS API is a rich api that provides operations such as
- Connecting to an HDFS instance- General file operations (create, read, write)- Get information on files, directories, fs
Connect to HopsFS using HopsUtil:
from hopsutil import hdfs
pydoop_handle = hdfs.get()17
![Page 18: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/18.jpg)
HopsUtil - TensorBoard
from hopsutil import tensorboard
[...]
logdir = tensorboard.logdir()
sv = tf.train.Supervisor(is_chief=True, logdir=logdir, [...], save_model_secs=60)
18
![Page 19: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/19.jpg)
HopsUtil - Hyperparameter searching
from hopsutil import tflauncher
def training(learning_rate, dropout):[....]
params = {‘learning_rate': [0.001, 0.002, 0.003], 'dropout': [0.3, 0.5, 0.7]}tflauncher.launch(spark, training, params)
19
![Page 20: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/20.jpg)
HopsUtil - Logging
from hopsutil import hdfs
[...]
while not sv.should_stop() and step < steps:
hdfs.log(sess.run(accuracy))
[...]
20
![Page 21: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/21.jpg)
DEMO TIME!TensorFlow tour on HopsWorks
21
![Page 22: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/22.jpg)
22
How to get started
![Page 23: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/23.jpg)
23
How to get started (2)
![Page 24: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/24.jpg)
24
How to get started (3)
![Page 25: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/25.jpg)
25
TensorBoard
![Page 26: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/26.jpg)
26
Dela - Search for interesting datasets
![Page 27: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/27.jpg)
27
Dela - Import a Dataset
![Page 28: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/28.jpg)
Dela
28
p2p network of Hops clusters
Find and share interesting datasets
Exploits unused bandwidth and backs off in case of network traffic
![Page 29: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/29.jpg)
The Challenge
29
http://timdettmers.com/2017/08/31/deep-learning-research-directions
![Page 30: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/30.jpg)
Experiment Time and Research Productivity
● Minutes, Hours:○ Interactive analysis!
● 1-4 days○ Interactivity replaced by
many parallel experiments● 1-4 weeks
○ High value experiments only● >1 month
○ Don’t even try!
30
![Page 31: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/31.jpg)
Solution: Go distributed
31
![Page 32: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/32.jpg)
State-of-the-Art in GPU Hardware
32
![Page 33: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/33.jpg)
Nvidia DGX-1
33
![Page 34: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/34.jpg)
SingleRoot Commodity GPU Cluster Computing
34
![Page 35: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/35.jpg)
The budget side
35
Commodity Server*
➔ 10 Nvidia GTX 1080Ti◆ 11 GB Memory
➔ 256 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband➔ SingleRoot PCI Complex
10 x Commodity Server = 150K Euro
Nvidia DGX-1
➔ 8 Nvidia Tesla V100◆ 16 GB Memory
➔ 512 GB Ram➔ 2 Intel Xeon CPUs➔ Infiniband
➔ NVLink
Price per DGX-1 = 150K Euro
*www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems/
![Page 36: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/36.jpg)
36
Distributed TensorFlow
Distribute TensorFlow graph
Workers / Parameter server
Synchronous / Asynchronous
Model / Data parallelism
Problems:- Clusterspec- Manually starting process
![Page 37: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/37.jpg)
37
Introducing TensorFlowOnSpark by YAHOO!
Wrapper for Distributed TensorFlow
- Creates clusterspec automatically!- Runs on a Hadoop/Spark cluster- Starts the workers/parameter servers automatically- First attempt at “scheduling” GPUs- Simplifies the programming model- Manages TensorBoard- “Migrate all existing TF programs with < 10 lines of code”
37
![Page 38: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/38.jpg)
TensorFlowOnSpark architecture
38 HopsFs
Spark Driver
Spark ExecutorParameter
Server
Spark Executor
Worker
Spark Executor
Worker
![Page 39: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/39.jpg)
Scaling TensorFlowOnSpark
39
Near linear scaling up to 8 workers
*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!
![Page 40: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/40.jpg)
TensorFlowOnSpark on Hops
40
![Page 41: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/41.jpg)
41
Our improved TensorFlowOnSpark - 1
Problem:Use RAM (1GPU = 27GB RAM) as a proxy to ‘schedule’
GPUs.Solution:
Hops provides GPU scheduling!
41
![Page 42: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/42.jpg)
42
Our improved TensorFlowOnSpark - 2
Problem:A worker will wait until GPUs become available,
potentially forever!Solution:
GPU scheduling ensures that the GPU is only allocated for that particular worker.
42
![Page 43: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/43.jpg)
43
Our improved TensorFlowOnSpark - 3
Problem:Each parameter server allocates 1 GPU, this is a waste!
Solution:Only workers may use GPUs
43
![Page 44: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/44.jpg)
44
Conversion guide: TensorFlowOnSpark
TFCluster.run(spark, training_fun, num_executors, num_ps…)
Add PySpark and TensorFlowOnSpark imports
Create your own FileWriter
Replace tf.train.Server() with TFNode.start_cluster_server()
Full conversion guide for Distributed TensorFlow to TensorFlowOnSparkhttps://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide
44
![Page 45: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/45.jpg)
DEMO TIME!Distributed TF on Spark
45
![Page 46: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/46.jpg)
Distributed Stochastic Gradient Descent
46
![Page 47: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/47.jpg)
SDG with Data Parallelism (Single Host)
47
![Page 48: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/48.jpg)
Facebook: Scaling Synchronous SDGJune 2017: training time on ImageNet from 2 weeks to 1 hour
➔ ~90% scaling efficiency going from 8 to 256 GPUs
Learning rate heuristic/ Warm up phase/ Large batches
48Paper: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
![Page 49: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/49.jpg)
All-Reduce
49
N GPUs, K parametersComm. cost: 2(N-1) * K/N
Independent from # GPUs
overlap communication and computation
Drawback: Synchronous communication
From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/
![Page 50: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/50.jpg)
Baidu All-Reduce - Performance scaling
50From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/
![Page 51: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/51.jpg)
Horovod - Better than Baidu All-Reduce?
51
Fork of Baidu All-Reduce
Improvements
1. Replaced Baidu ring-allreduce with NVIDIA NCCL2. Tensor Fusion3. Support for larger models4. Pip package5. Horovod Timeline
![Page 52: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/52.jpg)
5252
Migrating existing code to run on Horovod
1. Run hvd.init()
2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. Local rank maps to unique GPU for the process.
3. Wrap optimizer in hvd.DistributedOptimizer. 4. Add hvd.BroadcastGlobalVariablesHook(0) to
broadcast initial variable states from rank 0 to all other processes.
![Page 53: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/53.jpg)
Horovod/Baidu AllReduce
53
Provide as a service on HopsWorks
Integration of All-Reduce with a Hadoop cluster- Use YARN to schedule GPUs
Scheduling of homogeneous GPUs and network- YARN supports node labels
HopsFS authentication/authorization
TensorBoard lifecycle management as in HopsUtil
![Page 54: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/54.jpg)
The teamActive contributors:Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.
Past contributors:Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Aruna Kumari Yedurupaka, Tobias Johansson, Roberto Bampi, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid.
54
![Page 55: Odsc workshop - Distributed Tensorflow on Hops](https://reader036.vdocuments.mx/reader036/viewer/2022062306/5a64d5c17f8b9a76038b4aa1/html5/thumbnails/55.jpg)
www.hops.iogithub.com/hopshadoop
@hopshadoop
55