automating machine learning workflows: a report from the trenches - jose a. ortega ruiz @ papis...

34
Automating Machine Learning Features and Workflows [email protected] PAPIs Connect Valencia, 2016

Upload: papisio

Post on 08-Jan-2017

410 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Automating Machine LearningFeatures and Workflows

[email protected]

PAPIs Connect Valencia, 2016

Page 2: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Page 3: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Page 4: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning as a System Service

The goal

Machine Learning as a systemlevel service

The means

I APIs: ML building blocks

I Abstraction layer overfeature engineering

I Abstraction layer overalgorithms

I Automation

Page 5: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Workflows

Dr. Natalia Konstantinova (http://nkonst.com/machine-learning-explained-simple-words/)

Page 6: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Workflows for real

Jeannine Takaki, Microsoft Azure Team

Page 7: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Automation Todayfrom bigml.api import BigML

api = BigML()

project = api.create_project({’name’: ’ToyBoost’})

orig_source =

api.create_source(source,

{"name": "ToyBoost",

"project": project[’resource’]})

api.ok(orig_source)

orig_dataset =

api.create_dataset(orig_source, {"name": "Boost"})

api.ok(orig_dataset)

trainset = api.get_dataset(trainset)

for loop in range(0,10):

api.ok(trainset)

model = api.create_model(trainset, {

"name": "ToyBoost - Model%d" % loop,

"objective_fields": ["letter"],

"excluded_fields": ["weight"],

"weight_field": "100011"})

api.ok(model)

batchp =

api.create_batch_prediction(model, trainset, {

"name": "ToyBoost - Result%d" % loop,

"all_fields": True,

"header": True})

api.ok(batchp)

batchp = api.get_batch_prediction(batchp)

batchp_dataset =

api.get_dataset(batchp[’object’])

trainset = api.create_dataset(batchp_dataset, {})

Page 8: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Automation Today

Problems of current solutions

Complexity Lots of details outside the problem domain

Reuse No inter-language compatibility

Scalability Client-side workflows hard to optimize

Not enough abstraction

Page 9: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Automation Today

Problems of current solutions

Complexity Lots of details outside the problem domain

Reuse No inter-language compatibility

Scalability Client-side workflows hard to optimize

Not enough abstraction

Page 10: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Automation Tomorrow

Solution: Domain-specific languages

Page 11: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Page 12: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Domain-specific Expressions (sexps)

(if (missing? "height")

(random-value "height")

(field "height"))

(window "income" 10)

(within-percentiles? "age" 0.5 0.95)

(cond (> (field "score") (mean "score")) "above average"

(= (field "score") (mean "score")) "below average"

"mediocre")

Page 13: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Domain-specific Expressions (JSON)

["if", ["missing?", "height"],

["random-value", "height"],

["field", "height"]]

["window", "income", 10]

["within-percentiles?", "age", 0.5, 0.95]

["cond", [">", ["field", "score"], ["mean", "score"]], "above average",

["=", ["field", "score"], ["mean", "score"]], "below average",

"mediocre"]

Page 14: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Domain-specific Expressions (sexps)

(if (missing? "height")

(random-value "height")

(field "height"))

(window "income" 10)

(within-percentiles? "age" 0.5 0.95)

(cond (> (field "score") (mean "score")) "above average"

(= (field "score") (mean "score")) "below average"

"mediocre")

Page 15: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Abstraction via the Language

;; (if (missing? "height")

;; (random-value "height")

;; (field "height"))

(ensure-value "height")

(window "income" 10)

(within-percentiles? "age" 0.5 0.95)

;; (cond (> (field "score") (mean "score")) "above average"

;; (= (field "score") (mean "score")) "below average"

;; "mediocre")

(discretize "score" "above above" "below average" "mediocre")

Page 16: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Abstraction via the User Interface

Page 17: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Remote for efficiency and reuse, local for discoverability

Page 18: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Flatline: A DSL for Feature Enginering

I Domain-specific: new fields from an input sliding window asdeclarative expressions

I Simple syntax: JSON → s-expressions

I Efficient: full server-side implementation

I Discoverable: in-browser client-side implementation

I Reusable: the same expressions usable from any languagebinding.

I Bonus: applicable to filtering

Page 19: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Page 20: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Workflows

A DSL for Machine LearningWorkflows?

Page 21: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Workflows

A DSL for Machine LearningWorkflows? Absolutely!

Page 22: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Workflows

Same problems, only worse. . .

Complexity Hairy logic and control-flow

Reuse More complex algorithms and behaviour very hard toport to other languages

Scalability Lots of iterations and intermediate resources veryhard to make efficient on the client side

Page 23: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Machine Learning Workflows

WhizzML, same solution, only better. . .

Page 24: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

WhizzML: A sexp-based, domain-specific language

(define apple

"https://s3.amazonaws.com/bigml-public/csv/nasdaq_aapl.csv")

(define source (create-and-wait-source {"remote" apple

"name" "whizz"}))

(define dataset (create-and-wait-dataset {"source" source}))

(define anomaly (create-and-wait-anomaly {"dataset" dataset}))

(define input {"Open" 275 "High" 300 "Low" 250})

(define score

(create-and-wait-anomalyscore {"anomaly" anomaly

"input_data" input}))

(get (fetch score) "score")

Page 25: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

WhizzML vs Flatline (as languages)

A better language:

I Better data structures (dictionaries, sets. . . )

I Better control-flow: (tail) recursion, iteration, loops

I Better abstraction: procedures

Page 26: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

WhizzML: Lambda Abstraction

Abstraction

(define (score-stock name input)

(let (base "https://s3.amazonaws.com/bigml-public/csv"

stock (str base "/" name)

source (create-and-wait-source {"remote" stock})

dataset (create-and-wait-dataset {"source" source})

anomaly (create-and-wait-anomaly {"dataset" dataset}))

(create-and-wait-anomalyscore {"anomaly" anomaly

"input_data" input})))

Page 27: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

WhizzML: Reusable Procedures

Abstraction

(score-stock "aapl" {"Open" 275 "High" 300 "Low" 250})

Page 28: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

WhizzML: Server-side fortes

A better server-side:

I Better reusability: scripts, executions and libraries asfirst-class ML resources

I Higher efficiency gains: automatic parallelism

I More opportunities for UI extensions

Page 29: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

WhizzML Source Code as a Machine Learning Resource

{"library":{

"imports":["12343addb343f2890f23492d"],

"source_code": "(define (mu2) (mu (g 3 8)))",

"exports": [{"name": "mu2", "signature": []}]}}

{"script":{

"parameters": [{"name": "remote_uri", "type": "string"},

{"name": "timeout", "type": "number",

"default": 10000}],

"source_code":

"(define id (create-source {\"remote\" remote_uri}))

(wait id timeout)",

"outputs": [{"name": "id", "type": "source-id"}]}}

Rich metadata, reuse and shareability of WhizzML code

Page 30: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Executions as a Machine Learning Resource

{"execution": {"script_id": "1a2232bf3498f95dde",

"username": "bittwidler",

"tlp": 4,

"resource_limits": {"total": 50,

"source": 10,

"dataset": 5,

"model": 10},

"max_exection_time": 3600,

"max_execution_steps": 10000,

"max_recursion_depth": 1024}}

Page 31: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Executions as a Machine Learning Resource

{"execution": {"script_id": "1a2232bf3498f95dde","username": "bittwidler","tlp": 4,"resource_limits": {"total": 50,

"source": 10,"dataset": 5,"model": 10},

"max_exection_time": 3600,"max_execution_steps": 10000,"max_recursion_depth": 1024}}

Page 32: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

WhizzML: Client-side fortes

A better client-side:

I Better interactive experience: read-eval-print loop

I Scripts usable from the user’s machine

I Interoperability: Java, JavaScript and NodeJS REPLs

I Challenge: behaviourial coherence between server and clientsides

Page 33: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Outline

Introduction: ML as a System Service

Feature Engineering Automation

Workflow Automation

Challenges and Outlook

Page 34: Automating Machine Learning Workflows: A Report from the Trenches - Jose A. Ortega Ruiz @ PAPIs Connect

Challenges

Solved

I Local REPL and remote shared implementation

I Automatic parallelization

I Error reporting

I Traceability: stack traces and stepwise execution

Open

I Better error management (dynamic typing, type inferencer)

I Resumable workflows

I Data locality: optimizing repeated access to the same datasets