parallel tuning of machine learning algorithms, thesis proposal

Parallel auto-tuning of machine learning algorithms Gianmario Spacagna [email protected]

16 October 2012

(877) 769-3047 (408) 404-0152 fax [email protected]

AgilOne, Inc. 1091 N Shoreline Blvd. #250 Mountain View, CA 94043

Motivation • Increase revenue of cloud service providers à Keep cost curve linear w.r.t. the expected exponential income growth.

• Technically achievable through Scalability: • Scalability in terms of resources à Distributed Parallel

Computing (Hadoop). • Scalability in terms of multi-tenancy à Same system

running for several customers. • Scalability in terms of auto-configuration à Avoiding manual tuning up operations.

2

Income Cost

Good Work Flow

3

Good data

ML Algorithm

Good results!

Tuning (Adjusting configuration)

General Tuning diagram

4

Test Data

Run algorithm with conf. X

Are results good?

Tuned

Change configuration

X

yes

no

Tuning of Machine Learning Algorithms

• We need tuning when: • New algorithm or version is released. • We want to improve accuracy and/or performance. • New customer comes and the system must be customized for the

new dataset and requirements.

We need to make it smart, automatic and scalable!

5

Vision

6

Magic Box

Request: •  Data set •  Application

(prediction, clustering, classification…) •  Algorithm

(ANN, LR, K-means…) •  Fitness metrics

(Std. dev, Prob. of false true, clustering coeff., randomness…)

•  Goal constraints (x> 0.9 & 0.3<y<0.5)

Response: •  Best algorithm •  Optimal

configuration •  Metrics

evaluation

Architecture Design

7

Initializer

Upper Applications API

Controller

Scheduler

Executor ANN

Hadoop Cloud Service

Executor LR

Executor K-Means

Evaluator

Data Sampler

Evaluator

Data Sampler

Evaluator

Data Sampler

Local

Upper Applications API

Tasks: • Interfaces the communication between the system and the upper applications layer.

• Parse requests and results and generates the related output domain object.

Possible data format: • JSON • STDIN/OUT

8

Initializer

Tasks: • Generates the initial set of configuration.

Possible implementations: • Random points • Latin Hyper Cube

• Dataset similarity

9

Controller

Tasks: • Compares and generates configurations.

• Decides the convergence of the tuning.

• Adapt the data sampling request.

Possible implementations: • Random search • Grid search

• Stochastic Kriging • Genetic Algorithms

10

Scheduler

Tasks: • Checks if the requests are covered by the available services.

• Schedules and parallelizes requests executions.

• Optimizes resources.

• Collects evaluated results.

Possible implementations: • First available • Oldest idle

• Load balanced • Serialized (single node)

11

Executor

Tasks: • Executes the providing algorithm with the specified configuration.

Possible implementations: • Local execution • Hadoop cluster

• Cloud service

12

Sub components: •  Evaluator: Evaluates results

standing to the specified fitness metrics.

•  Data Sampler: Down and Up sampling of data.

Tuning diagram

13

Test Data

Run algorithm with conf. X

Are results good?

Tuned

Change configuration

X

yes

no

Scheduler, Executor Initializer,

Controller

Test execution Test control

SUNS: Simple, Unclever and Not Scalable

14

Random Points

STDIN/OUT

Random Search – Grid Search

Serialized

Executor K-Means

Local

Evaluator

15

Latin Hyper Cube

STDIN/OUT or JSON

Genetic Algorithm / Stochastic Kriging

Serialized

Executor K-Means

Local

Evaluator

SNS: Smart but Not Scalable

16

Dataset Similarity

STDIN/OUT or JSON

Genetic Algorithm / Stochastic Kriging

Serialized

Executor K-Means

Local

Evaluator

VSNS: Very Smart but Not Scalable

17

Dataset Similarity

STDIN/OUT or JSON

Genetic Algorithm or Stochastic Kriging

First Available

Executor K-Means

Hadoop

Evaluator

VSS: Very Smart and Scalable

18

Dataset Similarity

STDIN/OUT or JSON

Genetic Algorithm or Stochastic Kriging

Load Balanced

Executor K-Means

Hadoop

Evaluator

VSVSO: Very Smart, Very Scalable and Optimized

Data Sampler

Thesis

It is possible to build an intelligent system based on Genetic Algorithm/Stochastic

Kriging that automatically selects and tunes machine learning algorithms, such

as K-Means and LR, parallelizing the work on an Hadoop cluster to scale in a

cost-efficient manner.

19

Project Plan 1.  Design the entire application in Scala in a testable and expandable

way.

2.  Implement the Genetic Algorithm or the Stochastic Kriging controller. 3.  Implement the Latin Hyper Cube initializer.

4.  Test with local instance algorithms (K-Means and/or LR).

5.  Develop and test at least one algorithm in MapReduce fashion using Hadoop.

6.  Test with real AgilOne cluster of servers. 7.  Implement the Dataset Similarity initializer.

8.  Implement the Dataset Sampler.

20

Order of priorities:

Questions, feedbacks, suggestions?

21

Thank you!

22

parallel tuning of machine learning algorithms, thesis proposal

Documents