best practices machine learning final
TRANSCRIPT
© 2013 Datameer, Inc. All rights reserved.
Best Practices for Big Data Analytics with Machine Learning
Dr. Alex Guazzelli has co-authored the first book on PMML, the Predictive Model Markup Language. At Zementis, Dr. Guazzelli is responsible for developing core technology and analytical solutions for Big Data and real-time scoring. Most recently, Dr. Guazzelli started teaching a class on standards for predictive analytics at UC San Diego Extension.
About our Speakers
Dr. Alex GuazzelliZementis Vice President, Analytics (@DrAlexGuazzelli)
• Came from Infomatica• Worked with start-ups• Infomatica purchased to bring data
solutions to market• Data quality• Master data management • B2B
• Data security solutions
About our Speakers
• Over 15 years of enterprise software experience
• Co-authored 4 patents
• Worked in a variety of engineering, marketing and sales roles
• Bachelors of Science degree in Management Science and Engineering from Stanford University
Karen HsuDatameer Senior Director, Product Marketing (@Karenhsumar)
Agenda• Considerations
• Best Practices
• Demonstration
• Q&A
© 2013 Datameer, Inc. All rights reserved.
Considerations
Considerations
Target Users
Questions
Business IT
Descriptive Predictive Prescriptive
Data Scientist
■Visual
BusinessProfessional
Clustering
Decision Trees
Dependencies
+ More!
Target Users
IT
▪Flexible, powerful
Target Users
▪Algorithms▪SAS, SPSS, R
Data Scientist
Target Users
■Descriptive machine learning…– Tells you what has happened
Descriptive Predictive Prescriptive
Questions
■Predictive machine learning…– Answers the question what will happen
Descriptive Predictive Prescriptive
Questions
■Prescriptive machine learning…– What will happen, when it will happen,
why it will happen– Predict what will happen and prescribe
how to take advantage of this future
Descriptive Predictive Prescriptive
Questions
© 2013 Datameer, Inc. All rights reserved.
Best Practices
Lean Analytics
Data Preparation
Descriptive Analytics
Predictive Analytics
Descriptive vs. Predictive Analytics
Descriptive Analytics answers “What happened?” Predictive Analytics answers “What will happen next?”
Predictive Analytics helps you discover patterns in the past, which can signal what is ahead.
Predictive analytics is able to discover hidden patterns in historical data that the human expert may not see. It is in fact the result of mathematics applied to data. As such, it benefits from clever mathematical techniques as well as good data.
??
Example: Predicting Churn
Matt - Churned 2 days ago
Scott - “Liked” our company last week
John - ??
Churn-related featuresMatt3 complaints in last 6 monthsOpened 2 support tickets in last 4 weeksSpent a total of $1,234 buying merchandiseSpent a total of $123 in servicesPurchased 2 items in last 4 weeks Is 34 years oldIs a maleLives in Los Angeles...
ScottNo complaints in last 6 monthsOpened 1 support ticket in last 4 weeksSpent a total of $9,876 buying merchandiseSpent a total of $987 in servicesPurchased 12 items in last 4 weeks Is 54 years oldIs a maleLives in Chicago...
Big Data An ever expanding ocean of data containing
people and sensor data (lots and lots of it):
90% of the data todaycreated in last 2 years
Breadth and Depth
Transaction recordsSocial mediaClimate informationMobile GPS signalsHealthcareSmart Grid Digital Breadcrumbs
Churn-related “Big Data” featuresMatt12 friends listed as customers2 complaints from friends in last 6 monthsAverage age of friends is 41 years old2 friends churned in last 30 daysNo purchases for same items as friends1 website visit in last 7 days2 website pages opened during last visitOpened 3 newsletters in last 6 months...
Scott34 friends listed as customers1 complaint from friends in last 6 monthsAverage age of friends is 62 years oldNo friends churned in last 30 daysPurchased same 2 items as friends in last 2 months3 website visits in last 7 days5 website pages opened during last visitOpened 12 newsletters in last 6 months...
PredictiveModel
Building a predictive model ...Model Training
Churn-relatedfeatures
ChurnedNot-churned
Data Prediction
HiddenLayer
InputLayer
OutputLayer
Neural NetworksLinear/Logistic RegressionSupport Vector MachinesScorecardsDecision TreesClusteringAssociation RulesK-Nearest NeighborsNaive Bayes Classifiers...
Why not several models?
Model Ensemble
Data Pre-Processing
Raw Inputs
Prediction
Scores from all models are computed
Majority Voting, Weighted Voting,
Weighted Average, etc.
Model 1
Model 2
Model n
Voting
...
End Goal: Predicting churn ...
Model Deployment and Execution in
ChurnRisk
ScoreChurn-related
Features
Big Data
PredictiveChurnModel
ProductionEnvironment
Scientist’s Desktop
SAS, R, IBM SPSS, Perl,
Python
Java, .NETC, SQL
Lost in Translation
From Model Building to Model Deployment(Traditionally ...)
SAS, R, IBM SPSS …
Great for model building but not for scoring, even more so when it comes
to Hadoop
From Model Building to Model Deployment (with PMML)
Model Building Model Deployment and Execution Angoss
BigML
FICO Model Builder
IBM SPSS
KNIME
KXEN
Microstrategy
Open Data
Pervasive DataRush
RapidMiner
R / Rattle
SAS
SAP Business Objects
Salford Systems
StatSoft STASTISTICA
SQL Server
TIBCO Spotfire
Custom Code, etc.
Universal PMML Plug-in (UPPI)
PMML(models)
PMML(models)
PMML(models)PMML
Datameer Server
Deploy in minutes ...
PMML is an XML-based language used to define statistical and data mining models and to share these between compliant applications.
It is a mature standard developed by the DMG (Data Mining Group) to avoid proprietary issues and incompatibilities and to deploy models.
PMML eliminates need for custom model deployment and ensures reliability.
PMML defines a standard not only to represent data-mining models, but also data handling and data transformations (pre- and post-processing)
Predictive Model Markup Language
Models
DataTransformations
Neural Networks (neural gas, radial-basis and backpropagation)
Support Vector Machines (for classification and regression)
Naive Bayes Classifier (for continuous and categorical inputs)
Rule Set Models
Clustering Models (2-step clustering, distribution and center-based)
Decision Trees (for classification and regression)
General Regression Models (Cox, General and Generalized Linear Models)
Regression Models (Linear, Logistic and Polynomial Regression Models)
Scorecards (with support for Reason Codes)
Restricted Boltzmann Machines
Association Rules
Multiple Models (with the possibility of having models spread over multiple PMML files)
Model Ensemble (including Random Forest Models and Boosted Trees)
Model Segmentation
Model Chaining
Model Composition
Model Cascade
UPPI: Supported Techniques
© Zementis, Inc. - Confidential
Demonstration Flow
DescriptivePredictiveModeling
PrescriptivePredictiveProduction
Karen Alex KarenKaren
© 2013 Datameer, Inc. All rights reserved.
Descriptive Analytics
Descriptive Analytics▪Answers: What caused people to
churn?
▪Clustering▪Column Dependencies▪Decision Tree
Demonstration Flow
DescriptivePredictiveModeling
PrescriptivePredictiveProduction
Karen Alex KarenKaren
© 2013 Datameer, Inc. All rights reserved.
Predictive Analytics
Predictive Analytics▪Who will churn?
Demonstration Flow
DescriptivePredictiveModeling
PrescriptivePredictiveProduction
Karen Alex KarenKaren
© 2013 Datameer, Inc. All rights reserved.
Prescriptive Analytics
Prescriptive Analytics▪Who will churn? Why will they churn?▪What can we do to support that
outcome?
Demonstration Flow
DescriptivePredictiveModeling
PrescriptivePredictiveProduction
Karen Alex KarenKaren
Q&A
Next Steps:
Page 40
More about Datameer and Big Datawww.datameer.com
More about Zementiswww.zementis.com
Contact us:Alex Guazzeli [email protected] Hsu [email protected]