pipelineai + aws sagemaker + distributed tensorflow + ai model training and serving - december 2017...

PIPELINE.AI: HIGH PERFORMANCE MODEL TRAINING & SERVING WITH GPUS…

…AND AWS SAGEMAKER, GOOGLE CLOUD ML, AZURE ML & KUBERNETES!

CHRIS FREGLYFOUNDER @ PIPELINE.AI

RECENT PIPELINE.AI NEWS

Sept 2017

Dec 2017

INTRODUCTIONS: ME§ Chris Fregly, Founder & Engineer @PipelineAI

§ Formerly Netflix, Databricks, IBM Spark Tech

§ Advanced Spark and TensorFlow Meetup§ Please Join Our 60,000+ Global Members!!

Contact [email protected]

@cfregly

Global Locations* San Francisco* Chicago* Austin* Washington DC* Dusseldorf* London

INTRODUCTIONS: YOU§ Software Engineer, Data Scientist, Data Engineer, Data Analyst

§ Interested in Optimizing and Deploying TF Models to Production

§ Nice to Have a Working Knowledge of TensorFlow (Not Required)

PIPELINE.AI IS 100% OPEN SOURCE

§ https://github.com/PipelineAI/pipeline/

§ Please Star 🌟 this GitHub Repo!

§ Some VC’s Value GitHub Stars @ $15,000 Each (?!)

PIPELINE.AI OVERVIEW450,000 Docker Downloads60,000 Users Registered for GA60,000 Meetup Members40,000 LinkedIn Followers2,200 GitHub Stars12 Enterprise Beta Users

WHY HEAVY FOCUS ON MODEL SERVING?Model Training

Batch & BoringOffline in Research Lab

Pipeline Ends at Training

No Insight into Live Production

Small Number of Data Scientists

Optimizations Very Well-Known

Real-Time & Exciting!!Online in Live Production

Pipeline Extends into Production

Continuous Insight into Live Production

Huuuuuuge Number of Application Users

**Many Optimizations Not Yet Utilized

<<<

Model Serving

100’s Training Jobs per Day 1,000,000’s Predictions per Sec

AGENDA

§ Deploy and Tune Models + Runtimes Safely in Prod

§ Compare Models Both Offline and Online

§ Auto-Shift Traffic to Winning Model or Cloud

§ Live, Continuous Model Training in Production

PACKAGE MODEL + RUNTIME AS ONE§ Build Model with Runtime into Immutable Docker Image

§ Emphasize Immutable Deployment and Infrastructure§ Same Runtime Dependencies in All Environments

§ Local, Development, Staging, Production§ No Library or Dependency Surprises

§ Deploy and Tune Model + Runtime Togetherpipeline predict-server-build --model-type=tensorflow \

--model-name=mnist \--model-tag=A \--model-path=./models/tensorflow/mnist/

Build LocalModel Server A

LOAD TEST LOCAL MODEL + RUNTIME§ Perform Mini-Load Test on Local Model Server

§ Immediate, Local Prediction Performance Metrics

§ Compare to Previous Model + Runtime Variationspipeline predict-server-start --model-type=tensorflow \

--model-name=mnist \--model-tag=A

pipeline predict --model-endpoint-url=http://localhost:8080 \--test-request-path=test_request.json \--test-request-concurrency=1000

Load Test LocalModel Server A

Start LocalModel Server A

PUSH IMAGE TO DOCKER REGISTRY§ Supports All Public + Private Docker Registries

§ DockerHub, Artifactory, Quay, AWS, Google, …

§ Or Self-Hosted, Private Docker Registrypipeline predict-server-push --image-registry-url=<your-registry> \

--image-registry-repo=<your-repo> \--model-type=tensorflow \--model-name=mnist \--model-tag=A

Push Image ToDocker Registry

CLOUD-BASED OPTIONS§ AWS SageMaker

§ Released Nov 2017 @ Re-invent§ Custom Docker Images for Training & Serving ie. PipelineAI Images§ Distributed TensorFlow Training through Estimator API§ Traffic Splitting for A/B Model Testing

§ Google Cloud ML Engine§ Mostly Command-Line Based§ Driving TensorFlow Open Source API (ie. Experiment API)

§ Azure ML

TUNE MODEL + RUNTIME AS SINGLE UNIT§ Model Training Optimizations

§ Model Hyper-Parameters (ie. Learning Rate)§ Reduced Precision (ie. FP16 Half Precision)

§ Post-Training Model Optimizations§ Quantize Model Weights + Activations From 32-bit to 8-bit§ Fuse Neural Network Layers Together

§ Model Runtime Optimizations§ Runtime Configs (ie. Request Batch Size)§ Different Runtimes (ie. TensorFlow Lite, Nvidia TensorRT)

POST-TRAINING OPTIMIZATIONS§ Prepare Model for Serving

§ Simplify Network§ Reduce Model Size§ Quantize for Fast Matrix Math

§ Some Tools§ Graph Transform Tool (GTT)§ tfcompile

After TrainingAfter

Optimizing!

pipeline optimize --optimization-list=[quantize_weights, tfcompile] \--model-type=tensorflow \--model-name=mnist \--model-tag=A \--model-path=./tensorflow/mnist/model \--output-path=./tensorflow/mnist/optimized_model

Linear Regression

RUNTIME OPTION: TENSORFLOW LITE§ Post-Training Model Optimizations

§ Currently Supports iOS and Android

§ On-Device Prediction Runtime§ Low-Latency, Fast Startup

§ Selective Operator Loading§ 70KB Min - 300KB Max Runtime Footprint

§ Supports Accelerators (GPU, TPU)§ Falls Back to CPU without Accelerator

§ Java and C++ APIs

RUNTIME OPTION: NVIDIA TENSOR-RT

§ Post-Training Model Optimizations§ Specific to Nvidia GPU

§ GPU-Optimized Prediction Runtime§ Alternative to TensorFlow Serving

§ PipelineAI Supports TensorRT!

DEPLOY MODELS SAFELY TO PROD§ Deploy from CLI or Jupyter Notebook§ Tear-Down or Rollback Models Quickly§ Shadow Canary Deploy: ie.20% Live Traffic§ Split Canary Deploy: ie. 97-2-1% Live Traffic

pipeline predict-cluster-start --model-runtime=tflite \--model-type=tensorflow \--model-name=mnist \--model-tag=B \--traffic-split=2

Start ProductionModel Cluster B

pipeline predict-cluster-start --model-runtime=tensorrt \--model-type=tensorflow \--model-name=mnist \--model-tag=C \--traffic-split=1

Start ProductionModel Cluster C

pipeline predict-cluster-start --model-runtime=tfserving_gpu \--model-type=tensorflow \--model-name=mnist \--model-tag=A \--traffic-split=97

Start ProductionModel Cluster A

AGENDA





COMPARE MODELS OFFLINE & ONLINE§ Offline, Batch Metrics

§ Validation + Training Accuracy§ CPU + GPU Utilization

§ Live Prediction Values§ Compare Relative Precision§ Newly-Seen, Streaming Data

§ Online, Real-Time Metrics§ Response Time, Throughput§ Cost ($) Per Prediction

VIEW REAL-TIME PREDICTION STREAM§ Visually Compare Real-Time Predictions

PredictionInputs

PredictionResults &

Confidences

Model B Model CModel A

PREDICTION PROFILING AND TUNING§ Pinpoint Performance Bottlenecks

§ Fine-Grained Prediction Metrics

§ 3 Steps in Real-Time Prediction1.transform_request()2.predict()3.transform_response()

AGENDA





LIVE, ADAPTIVE TRAFFIC ROUTING§ A/B Tests

§ Inflexible and Boring

§ Multi-Armed Bandits§ Adaptive and Exciting!

pipeline traffic-router-split --model-type=tensorflow \--model-name=mnist \--model-tag-list=[A,B,C] \--model-weight-list=[1,2,97]

AdjustTraffic Routing

Dynamically

SHIFT TRAFFIC TO MAX(REVENUE)§ Shift Traffic to Winning Model using AI Bandit Algos

SHIFT TRAFFIC TO MIN(CLOUD CO$T)

§ Based on Cost ($) Per Prediction

§ Cost Changes Throughout Day§ Lose AWS Spot Instances§ Google Cloud Becomes Cheaper

§ Shift Across Clouds & On-Prem

AGENDA





LIVE, CONTINUOUS MODEL TRAINING

§ The Holy Grail of Machine Learning§ Q1 2018: PipelineAI Supports Continuous Model Training!§ Kafka, Kinesis§ Spark Streaming

PSEUDO-CONTINUOUS TRAINING§ Identify and Fix Borderline Predictions (~50-50% Confidence)

§ Fix Along Class Boundaries

§ Retrain Newly-Labeled Data

§ Game-ify Labeling Process

§ Enable Crowd Sourcing

DEMO: TRAIN, DEPLOY, TEST MODEL



pipeline predict-server-build --model-type=tensorflow \--model-name=mnist \--model-tag=A \--model-path=./models/tensorflow/mnist/

THANK YOU!!



§ Reminder: VC’s Value GitHub Stars @ $15,000 Each (!!)

Contact [email protected]

@cfregly