multi-tenant deep learning and streaming as-a-service with … · 2019-05-20 · python (also...
Post on 06-Jul-2020
2 Views
Preview:
TRANSCRIPT
Multi-tenant Deep Learning and Streaming as-a-Service with HopsworksTheoflos Kakantousis (@theofloskak)COO – Logical Clocks AB
Big Data Moscow 2018
©2018 Logical Clocks AB. All Rights Reserved2
Deep Learning & Streaming-as-a-Service in Sweden
● Hopsworks
– Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service
– Built on Hops Hadoop (www.hops.io)
– hops.site, 600+ users as of September 2018 ● RISE SICS ICE
– 250 kW Datacenter, ~1000 servers
https://www.sics.se/projects/sics-ice-data-center-in-lulea
©2018 Logical Clocks AB. All Rights Reserved3
[…] the general consensus seems to be that everyoneexpects some gain in performance numbers if the dataset size is increased dramatically [...]
Deep Learning needs Big data
Sun et Al. - Revisiting Unreasonable Efectiveness of Data in Deep Learning Era - 2017
Joel et Al. - Deep Learning Scaling is Predictable, Empirically - 2017
©2018 Logical Clocks AB. All Rights Reserved4
AI Hierarchy of Needs
DataEngineers
DataScientists
DataScientists?
DDL(Distributed
Deep Learning)
Deep Learning, RL
Machine Learning (ML)
Data Analytics
Data Pipelines
Big Data
Lots of GPUs
GPUs
Full-stack Data Science
©2018 Logical Clocks AB. All Rights Reserved6
Hopsworks
Hopsworks
Rest API
©2018 Logical Clocks AB. All Rights Reserved7
Hopsworks
Develop Train Test Deploy
MySQL Cluster
Hive
InfuxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorfow
Jupyter
Jobs, Kibana, Grafana
RESTAPI
Big data needs scalable storage
©2018 Logical Clocks AB. All Rights Reserved9
HopsFS*
Metadata
Datanode
Namenode
● HDFS derivative with distributed metadata
– 37x increased capacity– 16x increased
throughput
HDFS Client
HDFS Client
Scale-out all layers
* HopsFS - https://goo.gl/yFCsGc
Scale Challenge Winner (2017)
Hops
©2018 Logical Clocks AB. All Rights Reserved10
HopsFS support for Small Files *
RAMNVMe Disk
Datanode
Namenode
> 64KB (Configurable)
< 1KB 1KB < > 64KB
● Integrates NVMe ● Open Images Dataset:
– 9m images– ~80% small fles (<64 KB)
NVMe Disk
Metadata layer - NDB
*Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al
Multi-tenancy
©2018 Logical Clocks AB. All Rights Reserved12
Projects for Software-as-a-Service
Proj-42 Proj-X
Shared TopicTopic /Projs/My/Data
Proj-AllCompanyDB
©2018 Logical Clocks AB. All Rights Reserved13
Manage Projects like Github
©2018 Logical Clocks AB. All Rights Reserved14
Share like in Dropbox
Share any Data Source/Sink: HDFS Datasets, Kafka Topics, etc
©2018 Logical Clocks AB. All Rights Reserved15
Project Authorization
● Data Owner Privileges– Import/Export data– Manage Membership– Share DataSets, Topics
● Data Scientist Privileges
– Write and Run code● Delegate Administration of
privileges to users
©2018 Logical Clocks AB. All Rights Reserved16
Custom Python environments with Conda
Python libraries are usable by Spark/Tensorfow
©2018 Logical Clocks AB. All Rights Reserved17
TLS (not Kerberos) for security in Hops
● X.509 Certifcates for authentication● 1 Certifcate for each project
user● New App certifcate
generated for each job
● Store an audit trail of the operations (read/write/create/etc) users and apps perform on HopsFs
Resource Manager
Node Manager
HopsFs
Generate App Cert
Auth w/ App Cert
Project_user cert
©2018 Logical Clocks AB. All Rights Reserved18
TLS certifcate generation
alice@gmail.com
Users don’t see the certifcates,authenticate using:• LDAP• password • 2-Factor Authentication
Add/DelUsers
Distributed Database
Insert/Remove CertsProject Mgr
RootCA
HDFSSparkKafkaYARN
Cert Signing Requests
IntermediateCertifcate Authority
Hopsworks
Streaming-as-a-Service
©2018 Logical Clocks AB. All Rights Reserved20
ETL Workloads
ParquetHive
Hopsworks Jobs
trigger
Elastic
pipelines transform raw datato structured data
HopsFS
©2018 Logical Clocks AB. All Rights Reserved21
Streaming Analytics in Hopsworks
HopsFS YARN
HopsFS YARN
Grafana/InfluxDBGrafana/InfluxDB
Elastic/KibanaElastic/Kibana
Public Cloud or On-PremisePublic Cloud or On-Premise
Parquet / ORC
Data Src
Batch Analytics
Kafka
…...MySQLMySQL
©2018 Logical Clocks AB. All Rights Reserved22
Lifecycle of a Streaming Job
Developer
1.Discover: Schema Registry and Kafka Broker Endpoints2.Create: Kafka Properties file with certs and broker
details3.Create: Producer/Consumer using Kafka Properties
4.Download: the Schema for the Topic from the Schema Registry
5.Distribute: X.509 certs to all hosts on the cluster6.Cleanup securely
Operations
Facilitate dev+ops with hops-util https://github.com/logicalclocks/hops-util
©2018 Logical Clocks AB. All Rights Reserved23
Kafka Self-Service UI
Manage & Share• Topics• ACLs• Avro Schemas
Manage & Share• Topics• ACLs• Avro Schemas
©2018 Logical Clocks AB. All Rights Reserved24
Realtime Logs
● YARN aggregates logs on job completion– No good to us for Streaming
● Collect logs and make them searchable in real-time using Filebeat, Logstash, Elasticsearch, and Kibana
©2018 Logical Clocks AB. All Rights Reserved25
Realtime Logs
©2018 Logical Clocks AB. All Rights Reserved26
Resource Monitoring/Alerting
©2018 Logical Clocks AB. All Rights Reserved27
Jupyter Notebooks
©2018 Logical Clocks AB. All Rights Reserved28
Dela* – A Global Ecosystem for Datasets
Peer-to-Peer Search and Download for Huge DataSets(ImageNet, YouTube8M, MsCoCo, Reddit, etc)
*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)
ML & Deep Learning-as-a-Service
©2018 Logical Clocks AB. All Rights Reserved30
HopsML Pipeline
©2018 Logical Clocks AB. All Rights Reserved31
HopsML Spark/TensorFlow Arch
Executor/Tf Executor/Tf
Driver
HopsFSTensorBoard Model Serving
Conda Envs
Conda Envs
Distributed Training
©2018 Logical Clocks AB. All Rights Reserved33
Deep Learning Hierarchy of Scale
DDLAllReduce
on GPU Servers
DDL with GPU Serversand Parameter Servers
Parallel Experiments on GPU Servers
Single GPU
Many GPUs on a Single GPU Server
Days/Hours
Days
Weeks
Minutes
Training Time for ImageNet
Hours
“My Model’s Training.”
Training
©2018 Logical Clocks AB. All Rights Reserved34
GPU Resource Requests in HopsYARN
HopsYARN HopsYARN
10 GPUs on 1 host
100 GPUs on 10 hosts with ‘Infiniband’
Hops supports a Hetrogenous Mix of GPUs
4 GPUs on any host
Experiments in Hopsworks
©2018 Logical Clocks AB. All Rights Reserved36
The boring part of the job
● Find good Hyperparameters for your model
● Test diferent confgurations● Automate this!
“I have to run a hundred experiments to fnd the best
model,” he complained, as he showed me his Jupyter notebooks.
“That takes time. Every experiment takes a lot of
programming, because there are so many diferent parameters.
[https://thomaswdinsmore.com/2018/01/30/predictions-for-2018/ ]
©2018 Logical Clocks AB. All Rights Reserved37
Experiments in TensorFlow/Hopsworks
● Run and evaluate multiple models in parallel on a subset of the dataset
Experiment 1 Experiment 2
Experiment 4Experiment 3
©2018 Logical Clocks AB. All Rights Reserved38
Reproducible Experiments
● Results tracking● Hyperparameter tracking● Jupyter notebook versioning● Conda Env versioning● WIP: Dataset versioning
©2018 Logical Clocks AB. All Rights Reserved39
Experiments Dashboard
©2018 Logical Clocks AB. All Rights Reserved40
TensorBoard (1)
©2018 Logical Clocks AB. All Rights Reserved41
TensorBoard (2)
©2018 Logical Clocks AB. All Rights Reserved42
HopsAPI*
● Python (also Java/Scala)– Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod, TensorFlowOnSpark– Parallel experiments
● Gridsearch● Model Architecture Search with Genetic Algorithms
– Secure Streaming Analytics with Kafka/Spark/Flink– SSL/TLS certs, Avro Schema, Endpoints for
Kafka/Zookeeper/etc.
* https://github.com/logicalclocks/hops-util-py
Model Serving
©2018 Logical Clocks AB. All Rights Reserved44
Standard serving infrastructure
Scale model serving with Kubernetes
Considered best practice by the community
Provide tools to easily:● Fault tolerance● Rolling release new models● Autoscaling
©2018 Logical Clocks AB. All Rights Reserved45
Model Monitoring
HopsFS
Serving infrastructure
Re-train and deploy new model
Model monitoring infrastructure
● Log model inference requests/results to Kafka● Spark monitors model performance and input data● When to retrain?
©2018 Logical Clocks AB. All Rights Reserved46
Model Serving on Kubernetes
©2018 Logical Clocks AB. All Rights Reserved47
Orchestrating Hops workfows
Data Collection
Experimentation Training ServingFeature
Extraction
Data Transformation & Verifcation
Test
Airflow (Hopsworks Operator)
Demo
©2018 Logical Clocks AB. All Rights Reserved49
Summary
● Build a single platform to cover the entire AI hierarchy of needs.
● Increase productivity of Data Scientists – Manage all your data pipelines and workflows
under a single roof– Have first-class support for Python / Streaming/
Deep Learning / ML / Data Governance / GPUs
Hopsworks → logicalclocks.comGitHub → github.com/logicalclocksTwitter → @logicalclocks
top related