presentación de powerpoint - ai & big data congress

Post on 15-Oct-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

portada

WORKSHOP 2MODELOS DE MACHINE LEARNING EN PRODUCCIÓN:

DIFERENTES ENFOQUES Y EL ROL DE LAS HERRAMIENTAS DE GESTIÓN DEL CICLO DE VIDA ML

Rohit KumarResearcher, Data Science and Big Data Analytics UnitEurecateurecat.org | @Eurecat_News | @Eurecat_Events

BIG DATA & AI Congress 2019

MACHINE LEARNING MODELS IN PRODUCTION

Dr. Rohit Kumar

Researcher, Eurecat

5BigData & AI Congress 2019

Who is this guy!!

Physicist Java Developer/System Architect (6+ years) Big Data Researcher

(5+ years)

I am not a ML expert or a Data Scientist!!I am just an engineer who can follow complex maths…

6BigData & AI Congress 2019

Quick Poll?

How many Data scientist?

How many Data engineer?

How many DevOps?

How many managers/architects?

Rockstars who does everything?

7BigData & AI Congress 2019

Quick Feedback: How many have this issue?

8BigData & AI Congress 2019

What I am going to talk about!!

Data wrangling Training Prediction

ML Life Cycle!! Not Data Science Life Cycle..

Acknowledgment: https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-production.html by Julien Kervizic

9BigData & AI Congress 2019

Training

1. One Off training

Where do you save your notebook/code?Where and how do you save your models?

10BigData & AI Congress 2019

Training

2. Batch training (continuous learning system)

Not simply retraining the model weights !!

Train Predict

But also feature processing, feature selection, model selections and parameter optimization.

11BigData & AI Congress 2019

AutoML

AutoWEKA is an approach for the simultaneous selection of a machine learning algorithm and its hyperparameters; combined withthe WEKA package it automatically yields good models for a wide variety of data sets.

TPOT is a data-science assistant which optimizes machine learningpipelines using genetic programming. (Good for Python users)

H2O AutoML provides automated model selection and ensembling forthe H2O machine learning and data analytics platform. (Nice for Java users)

TransmogrifAI is an AutoML library running on top of Spark.

12BigData & AI Congress 2019

To manage different data workflows

To automate Machine Learning

13BigData & AI Congress 2019

14BigData & AI Congress 2019

Azure ML

Offcourse there are easy to use fancy platforms !! (if you have money and do not want to get your hands too dirty !!)

15BigData & AI Congress 2019

Training

3. Real time trainingTrain

..

Incremental Training using “Online Machine Learning Algorithms”

16BigData & AI Congress 2019

Quick Poll: What kind of ML training approach you have used?

A. One Off training.

B. Batch training adhoc

C. Batch training using some platform.

D. Real time training

17BigData & AI Congress 2019

Saving your Models

In Python world:1) Pickle: 2) Joblib: significantly faster on large numpy arrays

In R world:1) saveRDS/readRDS – can be assigned to a new model variable2) save/load – returns the same model variable

In Java (H20 world):1) Pojo : Plan old java object2) Mojo: Model ObJect, Optimized

18BigData & AI Congress 2019

Saving your Models

“Change, which is the only constant in life, can be secured and facilitated bystandardization.”

19BigData & AI Congress 2019

Saving your Models

ONNX (Open Neural Network Exchange) format: developed byFacebook and Microsoft in collaboration it is an open-source standard for representing deep learning models that will let you transfer them between CNTK, Caffe2 and PyTorch.

Link: https://github.com/onnx/tutorials

TensorFlow PyTorch

20BigData & AI Congress 2019

Saving your Models

21BigData & AI Congress 2019

22BigData & AI Congress 2019

Saving your Models

PMML (Predictive model markup language): Developed by Data Mining Group (DMG) a consortium of commercial and open-source data mining companies and supported by IBM, AWS and Google.

PFA (Portable Format for Analytics): it is more flexible (using JSON instead of XML) and also handling data preparation along with the modelsthemselves.

23BigData & AI Congress 2019

Saving your Models

MLWriter/MLReader: Inherente to Spark

Supports not only saving a Model but a ML Pipeline.

The result of save for pipeline model is a JSON file for metadata whileParquet for model data, e.g. coefficients.

24BigData & AI Congress 2019

Is a model or Pipeline saved using Apache Spark ML persistence in Sparkversion X loadable by Spark version Y?• Major versions: No guarantees, but best-effort.

• Minor and patch versions: Yes; (backwards compatible)

• Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible.

Does a model or Pipeline in Spark version X behave identically in Spark versionY?

• Major versions: No guarantees, but best-effort.

• Minor and patch versions: Identical behavior, except for bug fixes.

25BigData & AI Congress 2019

Quick Poll: Machine Learning Deployment

A. One time PPT/PDF/Report for Managers.

B. Deployed in Server for Batch Prediction.

C. Deployed in a streaming platform for real time prediction.

26BigData & AI Congress 2019

Batch vs. Real-time Prediction

Load implications Infrastructure Implications

Cost Implications Evaluation Implication

27BigData & AI Congress 2019

Batch Prediction

Data

Load Model

Featureextracted

SaveApply to the modelon data

Used by other systems

Use workflow managers like Airflow, or as simple as crontab etc.

28BigData & AI Congress 2019

Real Time Prediction

1. Database Trigger based prediction

PostGres supprts native Python integration and can run Python script on trig

Ms SQL Server has Machine Learning Services (In-Database)

Teradata can run R/Python script through an external script command.

Oracle supports PMML model through its data mining extension.

29BigData & AI Congress 2019

Real Time Prediction

2. Message queue based prediction service

A combination of Kafka for message queue and Spark Streaming forprocessing.

Azure eventhub with Azure functions

Google Pub/Sub with Apache Beam.

30BigData & AI Congress 2019

Real Time Prediction

3. Web services based deployment

Get unique ids and let web service pull related data from backend system and make prediction.

Get the complete payload with data and make the prediction.

A combination of both.

31BigData & AI Congress 2019

Real Time Prediction

3.1 Using Functions!!

+ ML Net

32BigData & AI Congress 2019

33BigData & AI Congress 2019

Real Time Prediction

3.2 Using Container

Ununiform environments across models. Ununiform library requirements across models. Ununiform resource requirements across models. Scaling at model level

34BigData & AI Congress 2019

Real Time Prediction

3.3 Directly In-App

Google ML Kit or the likes of Caffe2or Apple’s core ML allows to leveragemodels within native applications

Tensorflow.js or ONNXJs allow running modelsdirectly on browser

35BigData & AI Congress 2019

Real Time Prediction

3.4 Using Notebooks!!!!

36BigData & AI Congress 2019

So Far So Good!!

37BigData & AI Congress 2019

Why ML Life Cycle is more complicated than SDLC!

Variety of tools and packages at disposal.

Hard to track useful experiments.

Reproducibility is hard.

Deploying is hard.

38BigData & AI Congress 2019

ML Life Cycle Management Platforms

Azure ML services (launched in 2015)

39BigData & AI Congress 2019

ML Life Cycle Management Platforms

Amazon SageMaker (launched in 2017)

40BigData & AI Congress 2019

ML Life Cycle Management Platforms

Google AI (launched in 2018)

41BigData & AI Congress 2019

ML Life Cycle Management Platforms

Cloudera DS Workbench (launched in 2018)

42BigData & AI Congress 2019

ML Life Cycle Management Platforms

Databricks Managed ML Flow (launched in 2019) : ML Flow in itself is a library and not a hosted platform.

43BigData & AI Congress 2019

ML Life Cycle Management Platforms

44BigData & AI Congress 2019

portada

top related