best practices machine learning final

40
© 2013 Datameer, Inc. All rights reserved. Best Practices for Big Data Analytics with Machine Learning

Upload: dianna-doan

Post on 15-May-2015

456 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Best practices machine learning final

© 2013 Datameer, Inc. All rights reserved.

Best Practices for Big Data Analytics with Machine Learning

Page 2: Best practices machine learning final

Dr. Alex Guazzelli has co-authored the first book on PMML, the Predictive Model Markup Language. At Zementis, Dr. Guazzelli is responsible for developing core technology and analytical solutions for Big Data and real-time scoring. Most recently, Dr. Guazzelli started teaching a class on standards for predictive analytics at UC San Diego Extension.

About our Speakers

Dr. Alex GuazzelliZementis Vice President, Analytics (@DrAlexGuazzelli)

Page 3: Best practices machine learning final

• Came from Infomatica• Worked with start-ups• Infomatica purchased to bring data

solutions to market• Data quality• Master data management • B2B

• Data security solutions

About our Speakers

• Over 15 years of enterprise software experience

• Co-authored 4 patents

• Worked in a variety of engineering, marketing and sales roles

• Bachelors of Science degree in Management Science and Engineering from Stanford University

Karen HsuDatameer Senior Director, Product Marketing (@Karenhsumar)

Page 4: Best practices machine learning final

Agenda• Considerations

• Best Practices

• Demonstration

• Q&A

Page 5: Best practices machine learning final

© 2013 Datameer, Inc. All rights reserved.

Considerations

Page 6: Best practices machine learning final

Considerations

Target Users

Questions

Business IT

Descriptive Predictive Prescriptive

Data Scientist

Page 7: Best practices machine learning final

■Visual

BusinessProfessional

Clustering

Decision Trees

Dependencies

+ More!

Target Users

Page 8: Best practices machine learning final

IT

▪Flexible, powerful

Target Users

Page 9: Best practices machine learning final

▪Algorithms▪SAS, SPSS, R

Data Scientist

Target Users

Page 10: Best practices machine learning final

■Descriptive machine learning…– Tells you what has happened

Descriptive Predictive Prescriptive

Questions

Page 11: Best practices machine learning final

■Predictive machine learning…– Answers the question what will happen

Descriptive Predictive Prescriptive

Questions

Page 12: Best practices machine learning final

■Prescriptive machine learning…– What will happen, when it will happen,

why it will happen– Predict what will happen and prescribe

how to take advantage of this future

Descriptive Predictive Prescriptive

Questions

Page 13: Best practices machine learning final

© 2013 Datameer, Inc. All rights reserved.

Best Practices

Page 14: Best practices machine learning final

Lean Analytics

Page 15: Best practices machine learning final

Data Preparation

Page 16: Best practices machine learning final

Descriptive Analytics

Page 17: Best practices machine learning final

Predictive Analytics

Descriptive vs. Predictive Analytics

Descriptive Analytics answers “What happened?” Predictive Analytics answers “What will happen next?”

Predictive Analytics helps you discover patterns in the past, which can signal what is ahead.

Predictive analytics is able to discover hidden patterns in historical data that the human expert may not see. It is in fact the result of mathematics applied to data. As such, it benefits from clever mathematical techniques as well as good data.

??

Page 18: Best practices machine learning final

Example: Predicting Churn

Matt - Churned 2 days ago

Scott - “Liked” our company last week

John - ??

Page 19: Best practices machine learning final

Churn-related featuresMatt3 complaints in last 6 monthsOpened 2 support tickets in last 4 weeksSpent a total of $1,234 buying merchandiseSpent a total of $123 in servicesPurchased 2 items in last 4 weeks Is 34 years oldIs a maleLives in Los Angeles...

ScottNo complaints in last 6 monthsOpened 1 support ticket in last 4 weeksSpent a total of $9,876 buying merchandiseSpent a total of $987 in servicesPurchased 12 items in last 4 weeks Is 54 years oldIs a maleLives in Chicago...

Page 20: Best practices machine learning final

Big Data An ever expanding ocean of data containing

people and sensor data (lots and lots of it):

90% of the data todaycreated in last 2 years

Breadth and Depth

Transaction recordsSocial mediaClimate informationMobile GPS signalsHealthcareSmart Grid Digital Breadcrumbs

Page 21: Best practices machine learning final

Churn-related “Big Data” featuresMatt12 friends listed as customers2 complaints from friends in last 6 monthsAverage age of friends is 41 years old2 friends churned in last 30 daysNo purchases for same items as friends1 website visit in last 7 days2 website pages opened during last visitOpened 3 newsletters in last 6 months...

Scott34 friends listed as customers1 complaint from friends in last 6 monthsAverage age of friends is 62 years oldNo friends churned in last 30 daysPurchased same 2 items as friends in last 2 months3 website visits in last 7 days5 website pages opened during last visitOpened 12 newsletters in last 6 months...

Page 22: Best practices machine learning final

PredictiveModel

Building a predictive model ...Model Training

Churn-relatedfeatures

ChurnedNot-churned

Data Prediction

HiddenLayer

InputLayer

OutputLayer

Neural NetworksLinear/Logistic RegressionSupport Vector MachinesScorecardsDecision TreesClusteringAssociation RulesK-Nearest NeighborsNaive Bayes Classifiers...

Page 23: Best practices machine learning final

Why not several models?

Model Ensemble

Data Pre-Processing

Raw Inputs

Prediction

Scores from all models are computed

Majority Voting, Weighted Voting,

Weighted Average, etc.

Model 1

Model 2

Model n

Voting

...

Page 24: Best practices machine learning final

End Goal: Predicting churn ...

Model Deployment and Execution in

ChurnRisk

ScoreChurn-related

Features

Big Data

PredictiveChurnModel

Page 25: Best practices machine learning final

ProductionEnvironment

Scientist’s Desktop

SAS, R, IBM SPSS, Perl,

Python

Java, .NETC, SQL

Lost in Translation

From Model Building to Model Deployment(Traditionally ...)

SAS, R, IBM SPSS …

Great for model building but not for scoring, even more so when it comes

to Hadoop

Page 26: Best practices machine learning final

From Model Building to Model Deployment (with PMML)

Model Building Model Deployment and Execution Angoss

BigML

FICO Model Builder

IBM SPSS

KNIME

KXEN

Microstrategy

Open Data

Pervasive DataRush

RapidMiner

R / Rattle

SAS

SAP Business Objects

Salford Systems

StatSoft STASTISTICA

SQL Server

TIBCO Spotfire

Custom Code, etc.

Universal PMML Plug-in (UPPI)

PMML(models)

PMML(models)

PMML(models)PMML

Datameer Server

Deploy in minutes ...

Page 27: Best practices machine learning final

PMML is an XML-based language used to define statistical and data mining models and to share these between compliant applications.

It is a mature standard developed by the DMG (Data Mining Group) to avoid proprietary issues and incompatibilities and to deploy models.

PMML eliminates need for custom model deployment and ensures reliability.

PMML defines a standard not only to represent data-mining models, but also data handling and data transformations (pre- and post-processing)

Predictive Model Markup Language

Models

DataTransformations

Page 28: Best practices machine learning final

Neural Networks (neural gas, radial-basis and backpropagation)

Support Vector Machines (for classification and regression)

Naive Bayes Classifier (for continuous and categorical inputs)

Rule Set Models

Clustering Models (2-step clustering, distribution and center-based)

Decision Trees (for classification and regression)

General Regression Models (Cox, General and Generalized Linear Models)

Regression Models (Linear, Logistic and Polynomial Regression Models)

Scorecards (with support for Reason Codes)

Restricted Boltzmann Machines

Association Rules

Multiple Models (with the possibility of having models spread over multiple PMML files)

Model Ensemble (including Random Forest Models and Boosted Trees)

Model Segmentation

Model Chaining

Model Composition

Model Cascade

UPPI: Supported Techniques

© Zementis, Inc. - Confidential

Page 29: Best practices machine learning final

Demonstration Flow

DescriptivePredictiveModeling

PrescriptivePredictiveProduction

Karen Alex KarenKaren

Page 30: Best practices machine learning final

© 2013 Datameer, Inc. All rights reserved.

Descriptive Analytics

Page 31: Best practices machine learning final

Descriptive Analytics▪Answers: What caused people to

churn?

▪Clustering▪Column Dependencies▪Decision Tree

Page 32: Best practices machine learning final

Demonstration Flow

DescriptivePredictiveModeling

PrescriptivePredictiveProduction

Karen Alex KarenKaren

Page 33: Best practices machine learning final

© 2013 Datameer, Inc. All rights reserved.

Predictive Analytics

Page 34: Best practices machine learning final

Predictive Analytics▪Who will churn?

Page 35: Best practices machine learning final

Demonstration Flow

DescriptivePredictiveModeling

PrescriptivePredictiveProduction

Karen Alex KarenKaren

Page 36: Best practices machine learning final

© 2013 Datameer, Inc. All rights reserved.

Prescriptive Analytics

Page 37: Best practices machine learning final

Prescriptive Analytics▪Who will churn? Why will they churn?▪What can we do to support that

outcome?

Page 38: Best practices machine learning final

Demonstration Flow

DescriptivePredictiveModeling

PrescriptivePredictiveProduction

Karen Alex KarenKaren

Page 39: Best practices machine learning final

Q&A

Page 40: Best practices machine learning final

Next Steps:

Page 40

More about Datameer and Big Datawww.datameer.com

More about Zementiswww.zementis.com

Contact us:Alex Guazzeli [email protected] Hsu [email protected]