scaling deep learning to 100s of gpus on hops hadoop€¦ · scaling deep learning to 100s of gpus...
TRANSCRIPT
![Page 1: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/1.jpg)
Scaling Deep Learning to 100s of GPUs on Hops Hadoop
Fabio BusoSoftware EngineerLogical Clocks AB
![Page 2: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/2.jpg)
2
HopsFS: Next generation HDFS
37xNumber of fles
16xThroughput
Scale Challenge Winner (2017)
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf
![Page 3: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/3.jpg)
3
Hops platform
Projects, Datasets, Users
HopsFS, HopsYARN, MySQL NDB Cluster
Spark, Tensorfow, Hive, Kafka, Flink
Jupyter, Zeppelin
Jobs, Grafana, ELK
RESTAPI
Version 0.3.0 just released!
![Page 4: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/4.jpg)
4
Python frst
Conda Repo
Project Conda env
Search
Install/Remove
Python-3.6, pandas-1.4,Numpy-0.9
Environment usable by Spark/Tensorfow
Hops python library: Make development easy● Hyperparameter searching● Manage Tensorboard lifecycle
![Page 5: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/5.jpg)
5
Find big datasets - Dela*
● Discover, Share and experiment with interesting datasets
● p2p network of Hops Cluster● ImageNet, YouTube8M, Reddit comments...● Exploits unused bandwidth
*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)
![Page 6: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/6.jpg)
Scale out level: 1Parallel Hyper parameter searching
![Page 7: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/7.jpg)
7
Parallel Hyperparameter searching
def model(lr, dropout):…
args_dict = {'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]}
args_dict_grid = util.grid_params(args_dict)
tflauncher.launch(spark, model, args_dict_grid)
Starts 6 parallel experiments
![Page 8: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/8.jpg)
Scale out Level: 2Distributed Training
![Page 9: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/9.jpg)
9
TensorFlowOnSpark (TFoS) by Yahoo!
● Distributed TensorFlow over Spark● Runs on top of a Hadoop cluster● PS/Workers executed inside Spark executors● Uses Spark for resource allocations
– Our version: exclusive GPUs allocations– Parameter server(s) do not get GPU(s)
● Manages Tensorboard
![Page 10: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/10.jpg)
10
Run TFoS
def training_fun(argv, ctx):
…..
TFNode.start_cluster_server()
…..
TFCluster.run(spark, training_fun, num_exec, num_ps…)
Full conversion guide: https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide
![Page 11: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/11.jpg)
Scale out level: Master of the dark artsHorovod
![Page 12: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/12.jpg)
12
PS server architecture doesn’t scale
From: https://github.com/uber/horovod
![Page 13: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/13.jpg)
13
Horovod by Uber
● Based on previous work done by Baidu
● Organize workers in a ring● Gradients updates distributed using All-Reduce
● Synchronous protocol
![Page 14: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/14.jpg)
14
All-Reduce
GPU1
GPU2
GPU3
a0 b0 c0
a1 b1 c1
a2 b2 c2
![Page 15: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/15.jpg)
15
All-Reduce
a0 b0 c0 + c2
a0 + a1 b1 c1
a2 b1 + b2 c2
GPU1
GPU2
GPU3
![Page 16: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/16.jpg)
16
All-Reduce
a0 b0 + b1 + b2 c0 + c2
a0 + a1 b1 c0 + c1 + c2
a0 + a1 + a2 b1 + b2 c2
GPU1
GPU2
GPU3
![Page 17: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/17.jpg)
17
All-Reduce
a0 b0 + b1 + b2 c0 + c2
a0 + a1 b1 c0 + c1 + c2
a0 + a1 + a2 b1 + b2 c2
GPU1
GPU2
GPU3
![Page 18: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/18.jpg)
18
All-Reduce
a0 + a1 + a2 b0 + b1 + b2 c0 + c2
a0 + a1 b0 + b1 + b2 c0 + c1 + c2
a0 + a1 + a2 b1 + b2 c0 + c1 + c2
GPU1
GPU2
GPU3
![Page 19: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/19.jpg)
19
All-Reduce
a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2
a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2
a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2
GPU1
GPU2
GPU3
![Page 20: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/20.jpg)
20
Hops AllReduce
import horovod.tensorflow as hvddef conv_model(feature, target, mode) …..def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..from hops import allreduceallreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')
![Page 21: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/21.jpg)
Demo time!
![Page 22: Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB](https://reader034.vdocuments.mx/reader034/viewer/2022051410/603a3795d1d667365c000bb4/html5/thumbnails/22.jpg)
Play with it → hops.io/?q=content/hopsworks-vagrant
Doc → hops.ioStar us! → github.com/hopshadoopFollow us! → @hopshadoop