open machine learning

the open experiment databasemeta-learning for the masses

Joaquin Vanschoren @joavanschoren

The Polymath story

Tim Gowers

Machine Learningare we doing it right?

Computer Science

• The scientific method• Make a hypothesis about the world

• Generate predictions based on this hypothesis

• Design experiments to verify/falsify the prediction

• Predictions verified: hypothesis might be true

• Predictions falsified: hypothesis is wrong

Computer Science

• The scientific method (for ML)• Make a hypothesis about (the structure of) given data

• Generate models based on this hypothesis

• Design experiments to measure accuracy of the models

• Good performance: It works (on this data)

• Bad performance: It doesn’t work on this data

• Aggregates (it works 60% of the time) not useful

Computer Science

• Aggregates (it works 60% of the time) not usefulHow can data be characterized on

which the algorithm works well?

Computer Science

• Aggregates (it works 60% of the time) not usefulHow can data be characterized on

which the algorithm works well? What is the effect of

parameter settings?

Meta-Learning

• The science of understanding which algorithms work well on which types of data

• Hard: thorough understanding of data and algorithms

• Requires good data: extensive experimentation

• Why is this separate from other ML research?• A thorough algorithm evaluation = a meta-learning study

• Original authors know algorithms and data best, have large sets of experiments, are (presumably) interested in knowing on which data their algorithms work well (or not)

Meta-Learning

With the right tools, can we make everyone a meta-learner?

ML algorithm design meta-learning

Large sets of experiments algorithm selection

algorithm characterizationdata characterization

bias-variance analysis

learning curvesdata insight

algorithm insight

algorithm comparisondatasets

source code

Open Machine Learning

Open science

World-wide Telescope

Open science

Microarray Databases

Open science

GenBank

Open machine learning?

• We can also be `open’• Simple, common formats to describe experiments, workflows,

algorithms,...

• Platform to share, store, query, interact

• We can go (much) further• Share experiments automatically (open source ML tools)

• Experiment on-the-fly (cheap, no expensive instruments)

• Controlled experimentation (experimentation engine)

Formalizing machine learning

• Unique names for algorithms, datasets, evaluation measures, data characterizations,... (ontology)

• Based on DMOP, OntoDM, KDOntology, EXPO,...

• Simple, structured way to describe algorithm setups, workflows and experiment runs

• Detailed enough to reproduce all experiments

Execution of a predefined setup

data run

machine

in out

data data

machine

in out

data data

machine

Also: start time author status,...

Plan of what we want to do

f(x)algorithm

setupfunction

setupwork!ow experiment

f(x)algorithm

setupfunction

part of

Hierarchical

f(x)algorithm

setupfunction

part ofp=!

parameter setting

HierarchicalParameterized

f(x)algorithm

setupfunction

part ofp=!

parameter setting

HierarchicalParameterized

Abstract/concrete

Algorithm Setup

algorithmsetup

Algorithm Setup

Fully defined algorithm configuration

algorithmsetup

part of

Algorithm Setup

algorithmsetup

p=!parameter settingimplementation

part of

f(x)function

Algorithm Setup

algorithmsetup

p=!parameter settingimplementation

part of

f(x)function

Algorithm Setup

algorithmsetup

p=!parameter setting

part of

f(x)function

setupimplementation

Algorithm Setup

algorithmsetup

algorithm

parameter setting

algorithm quality

part of

f(x)function

p=?parameter

f(x)mathematical function

implementation

Algorithm Setup

algorithmsetup

algorithm

parameter setting

algorithm quality

part of

f(x)function

p=?parameter

implementation

unique names

Algorithm Setup

algorithmsetup

algorithm

parameter setting

algorithm quality

part of

f(x)function

p=?parameter

implementation

unique names

Roles: learner, base-learner, kernel,...

f(x)algorithm

setupfunction

part of

Workflow Setup

algorithmsetup

work!ow

part of

Workflow Setup

algorithmsetup

work!ow

part of

source

connection

target

Workflow: components, connections, and parameters (inputs)

Workflow Setup

algorithmsetup

work!ow

part of

source

connection

target

Workflow: components, connections, and parameters (inputs)

Also: ports datatype

WorkflowExample

Weka.ARFFLoader

p=! location= http://...

2:loadData

Weka.Evaluation

p=! F=10

3:crossValidate

Weka.SMO

p=! C=0.01

4:learner

Weka.RBF

f(x) 5:kernel

p=! G=0.01

p=! S=1

url evalu-ations

predic-tions

logRuns=true logRuns=falselogRuns=true

1:mainFlow

WorkflowExample

Weka.ARFFLoader

2:loadData

Weka.Evaluation

p=! F=10

3:crossValidate

Weka.SMO

p=! C=0.01

4:learner

Weka.RBF

f(x) 5:kernel

p=! G=0.01

p=! S=1

url evalu-ations

predic-tions

1:mainFlow

Evaluations

7 Predictions

data data evalpred

predictions

evaluations

Weka.Instances

f(x)algorithm

setupfunction

part of

ExperimentSetup

algorithmsetup

work!ow experiment

part of

<X>experiment

variable

ExperimentSetup

algorithmsetup

work!ow experiment

part of

<X>experiment

variable

Also: experiment design, description, literature reference, author,...

Experiment Setup

Experiment SetupVariables: labeled tuples which can be

referenced in setups

in out

data data

machine

Also: start time author status,...

dataset evaluation model predictions

sourcedata run

data quality

EXPMLWeka.ARFFLoader

2:loadData

Weka.Evaluation

p=! F=10

3:crossValidate

Weka.SMO

p=! C=0.01

4:learner

Weka.RBF

f(x) 5:kernel

p=! G=0.01

p=! S=1

url evalu-ations

predic-tions

1:mainFlow

Demo(preview)

Learning curves

10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$

ic've)accuracy)

percentage)of)original)dataset)size)

RandomForest$C45$Logis<cRegression$RacedIncrementalLogitBoostAStump$NaiveBayes$SVMARBF$

Examples

When does one algorithm outperform another?

Examples

When does one algorithm outperform another?

Examples

Bias-variance profile + effect of dataset size

Examples

boosting

bagging

Examples

Taking it furtherSeamless integration

• Webservice for sharing, querying experiments

• Integrate experiment sharing in ML tools (WEKA, KNIME, RapidMiner, R, ....)

• Mapping implementations, evaluation measures,...

• Online platform for custom querying, community interaction

• Semantic wiki: algorithm/data descriptions, rankings, ...

Experimentation Engine

• Controlled experimentation (Delve, MLComp)• Download datasets, build training/test sets

• Feed training and test sets to algorithms, retrieve predictions/models

• Run broad set of evaluation measures

• Benchmarking (Cross-Validation), learning curve analysis, bias-variance analysis, workflows(!)

• Compute data properties for new datasets

Why would you use it?(seeding)

• Let the system run the experiments for you

• Immediate, highly detailed benchmarks (no repeats)

• Up to date, detailed results (vs. static, aggregated in journals)

• All your results organized online (private?), anytime, anywhere

• Interact with people (weird results?)

• Get credit for all your results (e.g. citations), unexpected results

• Visibility, new collaborations

• Check if your algorithm really the best (e.g. active testing)

• On which datasets does it perform well/badly?

Question

Is open

machine learning possible?

http://expdb.cs.kuleuven.be

Thanks

Gracias

Xie XieDanke

Dank U

Efharisto

Dhanyavaad

GrazieSpasiba

Kia oraTesekkurler

Diolch

KöszönömArigato

open machine learning

Technology

open access combining graph and machine learning methods

open carme - hpc meets machine learning...title open carme -...

open data and machine learning - prof bill buchanan

machine learning techniques for computer-aided …research...

integrating machine learning and˜open data intctbot...

open source business intelligence powered by machine...

final exam, 10701 machine learning, spring...

machine learning: machine learning:

machine learning and open source tools building contextual...

machine learning applications for electroencephalograph...

applying machine learning to open data sets to find new...

a rapidminer extension for open machine learning ·...

open-source machine learning: r meets weka -...

lecture 24: open problems in machine learning

open access a comparative analysis of machine …...

hypothesis open access black box of machine learning in

actico platform - machine learning · 2020. 12. 18. ·...

machine learning -ramya karri -rushin barot. machine...

10 years of open source machine learning

weka & knime open source machine learning tools · weka &...