real-time big data analytics: from deployment to production
DESCRIPTION
As the Big Data market has evolved, the focus has shifted from data operations (storage, access and processing of data) to data science (understanding, analyzing and forecasting from data). And as new models are developed, organizations need a process for deploying analytics from research into the production environment. In this talk, we'll describe the five stages of real-time analytics deployment: Data distillation Model development Model validation and deployment Model refresh Real-time model scoring We'll review the technologies supporting each stage, and how Revolution Analytics software works with the entire analytics stack to bring Big Data analytics to real-time production environments.TRANSCRIPT
Revolution Confidential
David M S mith @ revodavidV P Marketing and C ommunityR evolution Analytic s
R eal-Time P redictive A nalytics with B ig Datafrom deployment to produc tion
Revolution ConfidentialToday we’ll dis c us s :
What is “real-time predictive analytics with big data?” Case study: UpStream Software
Real-time big-data stack Five phases of deployment to production Just what do we mean by “Big Data” and
“Real-Time”, anyway? Recommendations Q&A
2
Revolution ConfidentialR eal-Time P redictive A nalytics with B ig Data
Real Time Milliseconds? Seconds? Hours? Days? In production, continuously updated
Big Data Size, flow rate, diversity Conflict with ‘real time’
Predictive Analytics Rear view: description, aggregation, tabulation Prediction, inference, statistical modeling
3
Revolution Confidential
C as e S tudy
www.upstreamsoftware.com
From: “How Big Data is Changing Retail Marketing Analytics”
http://bit.ly/upstream-webinar
“Given that our data sets are already in the terabytes and are growing
rapidly, we depend on Revolution R Enterprise’s scalability and fast
performance — we saw about a 4x performance improvement on 50
million records. It works brilliantly.”
John Wallace, CEO Upstream Software
Revolution ConfidentialUpS tream S oftware
5
UpStream’s Big Data Analytics engine analyzes and optimizes marketing mix for UpStream’s retailer clients Customer analytics and segmentation Revenue/ action attribution among channels and
marketing programs Customized Next Best Action per individual
prospect or customer Event-Triggered Marketing: responses to
consumer actions
Revolution ConfidentialB ig Data
Demographics: consumer, product, market Actions: web clicks, email clicks, mobile app usage,
call center logs, social, search … Outcomes: impressions, touches, orders (retail, online,
mobile)
6
Revolution ConfidentialP redic tive A nalytic s
• ETL• N marketing channels• Behavioral variables• Promotional data• Overlay data
• Time-to-event models• Exploratory data analysis• GAM survival models
• Scoring for inference• Scoring for prediction
• 5 billion scores per day per customer
UPSTREAM DATA FORMAT
CUSTOM VARIABLES (PMML)
Revolution ConfidentialR eal T ime
http://www.infoworld.com/d/big-data/williams-sonoma-uses-big-data-zero-in-customers-206641November 8, 2012 8
Revolution ConfidentialP redic tive vs Des c riptive A nalytic s
Retail sales are influenced by stores near the home
Newer customers are more sensitive to marketing
Google keywords perform worse than you think
Display advertising performs better than you think
Online sales benefit more from seasonal events than retail stores
9
Revolution ConfidentialR eal-time B ig DataP redic tive A nalytic s S tac k
10
Revolution ConfidentialData L ayer
11
Structured data RDBMS, NoSQL, Hbase, Impala
Unstructured data Hadoop MapReduce
Streaming data Web, social, sensors, operational systems
Some descriptive analytics done here
Revolution ConfidentialA nalytic s layer
Predictive analytics technology Development environment: build models Production environment: Deploy real-time scoring Engine for dynamic analytics
Local data mart Static, periodically updated from data layer Improves performance
12
Revolution ConfidentialIntegration L ayer
Connective tissue between end-user applications and analytics engine Engines for flow control, real-time scoring Rules engine / CEP engine
API for Dynamic Analytics Brokers communication between app developers
and data scientists
13
Revolution ConfidentialDec is ion L ayer
End-user interface to analytics system Customers Operations Business analysts C-suite
Familiar & simple interfaces Supports a range of end-user technologies Level of UI complexity dictated by need
14
Revolution ConfidentialR eal-Time Deployment
Five phases of deploying real-time predictive analytics with big data to production:
1. Data Distillation2. Model Development3. Validation and Deployment4. Real-time Scoring5. Model refresh
16
Revolution ConfidentialP has e 1: Data Dis tillation
The data in the Data Layer isn’t yet ready for predictive modeling. We need to: Extract features from unstructured text Topic modeling, sentiment analysis
Combine disparate data sources Filter for populations of interest Select relevant features and outcomes for
modeling Export to the data mart for predictive modeling
17
Revolution ConfidentialE xample: Data Dis tillation in Hadoop
18
UnstructuredData
Analytics Data Mart
Structured Data
Log Files
Event Streams
Scraped Text
HDFS Load Map-Reduce
Revolution R Enterprise for Hadoop: bit.ly/r-hadoop
rmr
Revolution ConfidentialA utomating the E xtrac tion P roc es s This process needs to be repeatable and
maintainable Create a re-usable R script: ASCII: rxDataStep SQL: rxImport, ROracle, RMySQL NoSQL: Rcassandra, rmongodb Hadoop: rmr, rhbase, Rhive Appliances: nza, teradataR, PL/R Feeds: rjson, XML, RCurl, twitteR, python
R script runs from Analytics Layer Processing: in Data Layer Output: structured file for Data Mart
19
Revolution ConfidentialData Dis tillation P roc es s
20
XDF
Revolution ConfidentialP has e 2: Model development
Goal: create a predictive model that is Powerful Robust Comprehensible Implementable
Key requirements for Data Scientists: Flexibility Productivity Speed Reproducibility
21
Revolution ConfidentialT he Model Development C yc le
22
Variable Transformation
Model Estimation
Model Refinement
Model Comparison / Benchmarking
Feature SelectionSampling
AggregationData Mart Predictive
Model
Download the White PaperR is Hotbit.ly/r-is-hot
Revolution ConfidentialTec hnology c ons iderations
Moving Big Data is slow So just move it once, to data mart near the development
and production environments Static data needed for modeling cycle But have a plan to refresh and update
Data layer not optimized for predictive analytics But good for descriptive (data distillation)
Revolution R Enterprise optimizations for the analytics layer R; XDF file format; Parallel External Memory Algorithms
23
Revolution ConfidentialP redictive Modeling A lgorithms
Algorithm Example Applications Big Data
Data Step ETL, data distillation, record/variable selection, variable transformation
Descriptive Statistics Exploratory Data Analysis, Data Validation
Tables & Cubes Reporting, contingency analysis
Correlation / Covariance Factor Analysis, Value at Risk
Linear regression Forecasting, Net present value estimation
Logistic Regression Response modeling, offer selection
Generalized LinearModels Capital reserve estimation, climate modeling
K-means clustering Customer Segmentation
Decision Trees Dynamic pricing, classification, variable importance
Model Prediction Real-time Scoring (decisions, offers, actions)
R CRAN packages Everything else
Parallel & distributed computing with R
Simulations, By-Group analysis, ensemble models, custom applications
24
Revolution ConfidentialHigh performance with dis tributed c lus ters
25
Data Scientists / Modelers Grid computing cluster
Revolution R EnterprisePlatform LSFMicrosoft HPC ServerMicrosoft AzureAd-hoc grids (SNOW)Shared SMP serverHadoop / HDFS
Revolution ConfidentialP has e 3: Model Validation and Deployment
Scoring rules map parameters to outputs Parameters: information known at real time prospect ID, product, webpage, …
Scores: values inferred in real time prices, recommendations, actions, …
Validation: backtest with historical data Deployment: code running in real time
26
Revolution ConfidentialValidation for produc tion
Refresh data mart Rebuild model, withholding a validation set Random sampling
Create accuracy measure ROC, positive rate, false positive rate
Measure and monitor
27
His
toric
al
Dat
a
Sco
res
Scoring Rules
Rea
l Out
com
es
Revolution ConfidentialDeployment with R c ode
Scoring rules captured as “R model objects” Move R code directly from development
environment to production server No recoding: Lowest cost, greatest speed &
reliability Most flexibility (not limited in model choice) Easy to validate with custom accuracy measures
28
New
D
ata
Sco
resR Model
Object
His
toric
al
Dat
a
Out
com
es R Model Object
Revolution ConfidentialManaging and S caling Deployed R C ode
29
Revolution R Enterprise includes RevoDeployR Code management & tracking Security & resource allocation Scalability for real-time demands Web Services API
Scoring and dynamic analytics Data visualization custom reporting Ad-hoc analytics Interactive data apps
Revolution ConfidentialC ode C onvers ion for S c oring
Business rules (e.g. IBM ILOG) Create directly with R code from model object
Recode in other languages SQL C++, etc.
Low latency Slow and costly
deployment Potential for errors
30
Revolution ConfidentialS c oring via P MML Some predictive models can be expressed in
the PMML standard (www.dmg.org) Not all, but growing all the time
Easily generate PMML with R “pmml” package PMML Scoring Engine: Zementis Adapa Databases and appliances may support PMML But supported standards vary
31
Dat
a
Sco
res
PMMLPMMLR Model Object pmml
Revolution ConfidentialP has e 4: R eal-Time S c oring
Scoring triggered by Decision Layer, brokered by Integration Layer R code
Revolution R Enterprise Server In-appliance (e.g IBM PureData System for Analytics)
SQL In-database
Rules Compiled code
Bespoke engines
May be using hardware from data layer But not the actual data
32
Revolution ConfidentialR eal-TimeS c oring: review
33
SQL
iLog
Near Real Time
Revolution ConfidentialP has e 5: Model R efres h
Deployed scoring rules no longer connected to Data Layer or Data Mart Enables real-time performance Need to refresh using recent data
Model refresh process Use data extract script Re-run model script Statistical review OR automated validation
34
Revolution ConfidentialS c heduled B atc h Updates
Periodic model refresh Weekly Daily Hourly
Automated validate/deploy process
35
ExcelRevoDeployR Server
Web Services API
Scheduler
Revolution R EnterpriseProduction Server Cluster
Revolution ConfidentialR e-developing the model
Even then, the underlying structure of the data may change: Important variables become non-significant Non-significant variables become important! New data sources become available
Model accuracy drift is a warning sign Return to Phase 2 or Phase 1
36
Revolution Confidential
How big is ‘B ig data’?
38
Petabytes Exabytes
Gigabytes Terabytes
Kilobytes / second
Megabytes / second
Megabytes
Revolution Confidential
How fas t is ‘R eal time’?
39
Weeks Days
Hours Minutes
Seconds or less
Milliseconds
Daily Hourly
Daily Hourly
Revolution ConfidentialR ec ommendations
Architect your Predictive Analytics Stack Best of breed vs single-vendor stack Diversify data sources and decision apps
R = Predictive AnalyticsRevolution R Enterprise provides: Analytics Layer: big data, performance Integration Layer: scalability, reliability
Get Help to Get Started Consider an SI partner for architecture, best
practices and implementation advice
40
Revolution ConfidentialR es ourc es Revolution R Enterprise : R for Big Data www.revolutionanalytics.com/products
Revolution Analytics Consulting Services www.revolutionanalytics.com/services Contact us: bit.ly/hey-revo
Rhadoop : Connecting R and Hadoop bit.ly/r-hadoop
Contact David Smith [email protected] @revodavid blog.revolutionanalytics.com
41
Revolution ConfidentialT hank you.
42
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular open source R statistics language.