ana m martinez - amidst toolbox- scalable probabilistic machine learning with flink

33
A Java Toolbox for Scalable Probabilistic Machine Learning 12-14 SEP 2016, Berlin Ana M. Martínez Aalborg University [email protected]

Upload: flink-forward

Post on 14-Apr-2017

102 views

Category:

Data & Analytics


0 download

TRANSCRIPT

AMIDST Toolbox – Flink Forward 1

A Java Toolbox for Scalable Probabilistic Machine Learning

12-14 SEP 2016, Berlin

Ana M. Martínez Aalborg University

[email protected]

AMIDST Toolbox – Flink Forward 2

Who are we?

THE AMIDST CONSORTIUM

3

Academic Industrial

AMIDST Toolbox – Flink Forward

4

Running Use Case

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

5

Predicting Defaulting Clients

Predicts probability a customer will default within 2 years

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

§  Daily data for millions of clients §  Tons of missing data. §  Odd distributions.

6 AMIDST Toolbox – Flink Forward

7

Toolbox presentation

AMIDST Toolbox – Flink Forward

GENERAL DESCRIPTION

8 AMIDST Toolbox – Flink Forward

AMIDST APPROACH

9

Data Knowledge

Openbox Models

Blackbox Inference Engine (Powered by Flink)

AMIDST Toolbox – Flink Forward

10

Main Features

AMIDST Toolbox – Flink Forward

PGMS

11

Probabilistic graphical models (PGMs) Specify your model using probabilistic graphical models with latent variables and temporal dependencies

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

12 AMIDST Toolbox – Flink Forward

PGMS

13

Custom Gaussian Mixture Model Hij defines a local mixture. Hi defines a global mixture.

AMIDST Toolbox – Flink Forward

PGMS

14

RUNNING CODE EXAMPLE

//Set-up Flink session. final ExecutionEnvironment env =

ExecutionEnvironment.getExecutionEnvironment(); //Load the data stream String filename = "hdfs://dataFlink_month0.arff"; DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false);

//Build the model Model model = new CustomGaussianMixture(data.getAttributes());

AMIDST Toolbox – Flink Forward

SCALABLE INFERENCE

15

Scalable Learning Perform Bayesian inference on your probabilistic models with powerful approximate and scalable algorithms.

AMIDST Toolbox – Flink Forward

SCALABLE INFERENCE

16

d-VMP Algorithm - Coded as iterative map-reduce task A state-of-the-art distributed variational message passing algorithm.

AMIDST Toolbox – Flink Forward

SCALABLE INFERENCE

17

RUNNING CODE EXAMPLE

AMIDST Toolbox – Flink Forward

//Set-up Flink session. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); //Load the data stream String filename = “hdfs://dataFlink_month0.arff"; DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false); //Build the model Model model = new CustomGaussianMixture(data.getAttributes());

//Learn the model model.updateModel(data);

DATA STREAMS

18

Data Streams Update your models when new data is available. This makes our toolbox appropriate for learning from data streams.

AMIDST Toolbox – Flink Forward

DATA STREAMS

19

//Set-up Flink session. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); //Load the data stream String filename = “hdfs://dataFlink_month0.arff"; DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false); //Build the model Model model = new CustomGaussianMixture(data.getAttributes()); //Learn the model model.updateModel(data); //Update your model for(int i=1; i<12; i++) { filename = “dataFlink_month"+i+".arff"; data = DataFlinkLoader.loadDataFromFolder(env, filename,false); model.updateModel(data); }

RUNNING CODE EXAMPLE

AMIDST Toolbox – Flink Forward

RUNNING USE CASE

20

Predicting Defaulting Clients §  Old BCC’s models based on logistic regression got an AUC of 0.816. §  AMIDST’s models gets an AUC of 0.952.

AMIDST Toolbox – Flink Forward

SCALABILITY ANALYSIS

21

Scalability analysis Use your defined models to process massive data sets in a distributed computer cluster using Flink.

AMIDST Toolbox – Flink Forward

SCALABILITY ANALYSIS

22

One billion node probabilistic model Experiment on a Flink cluster with 16 nodes on AWS.

AMIDST Toolbox – Flink Forward

SCALABILITY ANALYSIS

§  Speedup(withrespectto2nodes)

AMIDST Toolbox – Flink Forward 23/32

0

1

2

3

4

5

6

7

8

4nodes 8nodes 16nodes

MODULAR DESIGN

24

Modular Design

The AMIDST Toolbox has been designed following a modular structure. This makes easier:

§  The maintenance and enhancement of the software §  The integration with external software: HUGIN, MOA, Weka, R.

AMIDST Toolbox – Flink Forward

MODULAR DESIGN

25 AMIDST Toolbox – Flink Forward

26

Running Use Case II

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

27

Tracking Concept Drift Detects changes in customer profiles during Spanish financial crisis

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

28

Hidden Variables are used to capture changes in customer profile

MODEL

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

29

RUNNING CODE

AMIDST Toolbox – Flink Forward

//Set-up Flink session. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); //Load the data stream String filename = “hdfs://dataFlink_month0.arff"; DataFlink<DataInstance> data =

DataFlinkLoader.loadDataFromFolder(env, filename, false);

//Build the model Model model = new ConceptDriftDetector(data.getAttributes()); //Learn the model model.updateModel(data); //Update your model for(int i=1; i<12; i++) { filename = “dataFlink_month"+i+".arff"; data = DataFlinkLoader.loadDataFromFolder(env, filename,false); model.updateModel(data); System.out.println(model.getPosteriorDistribution(“hiddenVar”).

toString()); }

CONCEPT DRIFT DETECTION

30

Hidden Variable Captures Concept Drift Drift Pattern: Seasonal + Global trend

RESULTS

AMIDST Toolbox – Flink Forward

CONCEPT DRIFT DETECTION

31

Unemployment Rate main driver of Concept Drift Hidden Variable correlates with unemployment rate (rho = 0.961)

RESULTS

AMIDST Toolbox – Flink Forward

COLLABORATE

32

Unemployment Rate main driver of Concept Drift

Hidden Variable correlates with unemployment rate (rho = 0.961)

www.amidsttoolbox.com github.com/amidst/toolbox

Appache License 2.0

AMIDST Toolbox – Flink Forward

AMIDST Toolbox – Flink Forward 33

Thanks for your attention

@ [email protected]

@AmidstToolbox

www www.amidsttoolbox.com