automating machine learning - is it feasible?
TRANSCRIPT
Automating Machine Learning
Is it feasible?
Manuel Martin SalvadorSmart Technology Research Group
Bournemouth UniversityJune 2nd, 2016
Index
1.Recent life-changing applications of Machine Learning
2.Multicomponent Predictive Systems (MCPS)
3.Automating the composition and optimisation of MCPS
4.Adapting MCPS to changing environments
5.Conclusion and future work
Recent life-changing applications of
Machine Learning
Gene Discovery
Source: http://msgeneticslab.med.ubc.ca/gene-discovery/
Dessa Sadovnick and Carles Vilariño-GüellUniversity of British ColumbiaA mutation in NR1H3 protein can trigger Multiple
Sclerosis
Microsoft Seeing AI
Source: https://www.youtube.com/watch?v=R2mC-NUAmMk
Autonomous Vehicles
Source: https://www.youtube.com/watch?v=dk3oc1Hr62g
Instant Translation
Source: https://www.skype.com/en/features/skype-translator/
Multicomponent Predictive Systems
Predictive Modelling
Labelled
Data
Supervised
LearningAlgorithm
PredictiveModel
Classification and Regression
Data is imperfect
Missing Values Noise High
dimensionality
Outliers
Question Mark: http://commons.wikimedia.org/wiki/File:Question_mark_road_sign,_Australia.jpgNoise: http://www.flickr.com/photos/benleto/3223155821/Outliers: http://commons.wikimedia.org/wiki/File:Diagrama_de_caixa_com_outliers_and_whisker.png3D plot: http://salsahpc.indiana.edu/plotviz/
Multicomponent Predictive System (MCPS)
Data Postprocessing Prediction
sPreprocessing Predictive
Model
Multicomponent Predictive System (MCPS)
Preprocessing
Data
Predictive Model
Postprocessing Prediction
s
Preprocessing
Preprocessing
Predictive Model
Predictive Model
How to model MCPS?
Function composition: Not enough for modelling parallel paths.
Directed Acyclic Graph: Not enough to model process state.
Petri net: Very flexible and robust mathematical background.
Expr
essiv
e po
wer Y = h(g(f(X)))
f g hX Y
f g hX Y
Petri netMathematical modelling language invented in 1939 by Carl Adam Petri
token
place
transition
arc
N = (P,T,F)
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosisPatient
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosis
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosis
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosis
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosis
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosis
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosis
Example of Petri net
Reception Waiting Room
Check in
Consulting Room
Exit
Call inExamination and diagnosis
Modelling MCPS as Petri netA Petri net is an MCPS iff all the following conditions apply:
The Petri net is a workflow net.
The Petri net is well-handled and acyclic.
The places P\{i,o} have only a single input and a single output.
The Petri net is 1-sound.
The Petri net is safe.
All the transitions with multiple inputs or outputs are AND-join or AND-split, respectively.
The token at i is a set of unlabelled instances and the token at o is a set of predictions for such instances.
The Petri net can be hierarchical.
Hierarchical MCPS with parallel paths
dummy dummy
i o
Hierarchical MCPS with parallel paths
dummy dummy
i o
Random Feature
Selection
RandomSubspace
Decision Tree
Mean
Any questions so far?
Automating the composition and
optimisation of MCPS
Algorithm SelectionWhat are the best algorithms to process my data?
Hyperparameter OptimisationHow to tune the hyperparameters to get the best performance?
CASH problem for MCPSCombined Algorithm Selection and Hyperparameter configuration problem k-fold cross validation
Objective function(e.g. classification error)
Hyperparameters
MCPSs
Training dataset
Validation dataset
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms.In: Proc. of the 19th ACM SIGKDD. (2013) 847–855Martin Salvador M., Budka M., Gabrys B.: Automatic composition and optimisation of multicomponent predictive systems. IEEE Transactions on Knowledge and Data Engineering. under review - available at http://bit.ly/automatic-mcps-paper (submitted on 01/04/2016)
Search space
PREV
NEW
FULL
Predictor Meta-Predictor
Predictor Meta-Predictor
Predictor Meta-PredictorMissing Value
Handling
Outlier Detectio
n and Handling
Data Transformati
onDimensionality Reduction
Sampling
Hyperparameters
PREV
NEW FULL
756 1186 1564
Optimisation strategiesGrid search: exhaustive exploration of the whole search space. Not feasible in
high dimensional spaces.
Random search: explores the search space randomly during a given time.
Bayesian optimisation: assumes that there is a function between the hyperparameters and the objective and try to explore the most promising parts of the search space.
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential Model-Based Optimization for General Algorithm Configuration. Learning and Intelligent Optimization, 6683 LNCS, 507–523.
Auto-WEKA for MCPSWEKA methods as search space
One-click black boxData + Time Budget → MCPS
Our contribution● Recursive extension of complex
hyperparameters in the search space.
● Composition and optimisation of MCPSs (including WEKA filters, predictors and meta-predictors)
https://github.com/dsibournemouth/autoweka
Evaluated strategies1. WEKA-Def: All the predictors and meta-predictors are run using
WEKA’s default hyperparameter values.
2. Random search: The search space is randomly explored.
3. SMAC: Sequential Model-based Algorithm Configuration incrementally builds a Random Forest as surrogate model.
4. TPE: Tree-structure Parzen Estimation uses Gaussian Processes to incrementally build a surrogate model.
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential Model-Based Optimization for General Algorithm Configuration. Learning and Intelligent Optimization, 6683 LNCS, 507–523.J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl, Algorithms for Hyper-Parameter Optimization. in Advances in NIPS 24, 2011, pp. 1–9.
Experiments21 datasets (classification problems)
Budget: 30 CPU-hours (per run)
25 runs with different seeds
Timeout: 30 minutes
Memout: 3GB RAM
Training and testing process
Holdout error (% misclassification)
Convergence analysis10-fold CV error of best solutions over time (each color is a different run/seed)
MCPS similarity analysis
Weight for the i-th transition
Hamming distance at the i-th transition
Low error variance and high MCPS similarity
Low error variance and low MCPS
similarity
High error variance and low MCPS
similarity For FULL search space
MCPS similarity analysis: clustering
Waveform dataset and SMAC strategy
SMAC: Sequential Model-based Algorithm Configuration.
Auto-WEKA: toolbox including random search, SMAC and TPE for WEKA predictors.
Auto-WEKA for MCPS: extension of Auto-WEKA for MCPSs.
Auto-Sklearn: toolbox for automating scikit-learn.
Spearmint: python library for Bayesian optimisation with Gaussian Processes.
Hyperopt: python library for random search and TPE.
HPOLib: common interface for SMAC, Spearmint and Hyperopt.
BayesOpt: C++ library for Bayesian optimisation.
MOE: Metric Optimization Engine (by Yelp).
Available software for Bayesian optimisation
Any questions so far?
Adapting MCPS to changing
environments
Maintaining an MCPSData distribution can change over time and affect predictions
External factors (e.g. weather conditions, new regulations)
Internal factors (e.g. quality of materials, equipment deterioration)
Source: INFER project
Training and testing process
1. Training data is provided
2. Best MCPS found is selected
3. New batch of unlabelled data requires prediction
4. MCPS generates predictions5. True labels are provided
6. Predictive accuracy is reported7. MCPS is adapted using the last batch of labelled data
Evaluated strategies
Datasets from chemical production processes
Average classification error (%)
Average classification error per batch (%)
BaselineBatchBatch+SMACCumulativeCumulative+SMAC
drierthermaloxBatch adaptation doesn’t help! :(
Batch adaptation does help! :)
MCPS similarity analysis
Batch+SMAC
Cumulative+SMAC
catalyst catalyst
Same components, only hyperparameters are adapted
Large difference between batches
Conclusion and future workAutomatic machine learning is becoming a reality. There is a variety of open-source software but also commercial products (e.g. SigOpt and IBM Watson)
Domain expert is still playing a crucial role (e.g. defining the search space)
Smart techniques to reduce the search space are needed
Maintaining MCPSs in a production environment is key for success
Gap in adaptive surrogate models for Bayesian optimisation methods
Thanks!Publications with Marcin Budka and Bogdan Gabrys:
● “Towards automatic composition of Multicomponent Predictive Systems” - HAIS 2016 (published) http://bit.ly/towards-mcps-paper
● “Automatic composition and optimisation of Multicomponent Predictive Systems” - IEEE TKDE (under review) http://bit.ly/automatic-mcps-paper
● “Adapting Multicomponent Predictive Systems using hybrid adaptation Strategies with Auto-WEKA in process industry” - AutoML at ICML 2016 (accepted) http://bit.ly/adapting-mcps-paper
● “Effects of change propagation resulting from adaptive preprocessing in Multicomponent Predictive Systems” - KES 2016 (accepted) http://bit.ly/change-propagation-mcps-paper
Slides available in http://www.slideshare.net/draxus
Contact: Manuel Martin Salvador [email protected]