real-time big data analytics: from deployment to production

40
Revolution Confidential David M Smith @ revodavid VP Marketing and Community Revolution Analytics R eal-Time Predictive Analytics with B ig Data from deployment to production

Upload: revolution-analytics

Post on 26-Jan-2015

111 views

Category:

Documents


2 download

DESCRIPTION

As the Big Data market has evolved, the focus has shifted from data operations (storage, access and processing of data) to data science (understanding, analyzing and forecasting from data). And as new models are developed, organizations need a process for deploying analytics from research into the production environment. In this talk, we'll describe the five stages of real-time analytics deployment: Data distillation Model development Model validation and deployment Model refresh Real-time model scoring We'll review the technologies supporting each stage, and how Revolution Analytics software works with the entire analytics stack to bring Big Data analytics to real-time production environments.

TRANSCRIPT

Page 1: Real-time Big Data Analytics: From Deployment to Production

Revolution Confidential

David M S mith @ revodavidV P Marketing and C ommunityR evolution Analytic s

R eal-Time P redictive A nalytics with B ig Datafrom deployment to produc tion

Page 2: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialToday we’ll dis c us s :

What is “real-time predictive analytics with big data?” Case study: UpStream Software

Real-time big-data stack Five phases of deployment to production Just what do we mean by “Big Data” and

“Real-Time”, anyway? Recommendations Q&A

2

Page 3: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR eal-Time P redictive A nalytics with B ig Data

Real Time Milliseconds? Seconds? Hours? Days? In production, continuously updated

Big Data Size, flow rate, diversity Conflict with ‘real time’

Predictive Analytics Rear view: description, aggregation, tabulation Prediction, inference, statistical modeling

3

Page 4: Real-time Big Data Analytics: From Deployment to Production

Revolution Confidential

C as e S tudy

www.upstreamsoftware.com

From: “How Big Data is Changing Retail Marketing Analytics”

http://bit.ly/upstream-webinar

“Given that our data sets are already in the terabytes and are growing

rapidly, we depend on Revolution R Enterprise’s scalability and fast

performance — we saw about a 4x performance improvement on 50

million records. It works brilliantly.”

John Wallace, CEO Upstream Software

Page 5: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialUpS tream S oftware

5

UpStream’s Big Data Analytics engine analyzes and optimizes marketing mix for UpStream’s retailer clients Customer analytics and segmentation Revenue/ action attribution among channels and

marketing programs Customized Next Best Action per individual

prospect or customer Event-Triggered Marketing: responses to

consumer actions

Page 6: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialB ig Data

Demographics: consumer, product, market Actions: web clicks, email clicks, mobile app usage,

call center logs, social, search … Outcomes: impressions, touches, orders (retail, online,

mobile)

6

Page 7: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP redic tive A nalytic s

• ETL• N marketing channels• Behavioral variables• Promotional data• Overlay data

• Time-to-event models• Exploratory data analysis• GAM survival models

• Scoring for inference• Scoring for prediction

• 5 billion scores per day per customer

UPSTREAM DATA FORMAT

CUSTOM VARIABLES (PMML)

Page 8: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR eal T ime

http://www.infoworld.com/d/big-data/williams-sonoma-uses-big-data-zero-in-customers-206641November 8, 2012 8

Page 9: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP redic tive vs Des c riptive A nalytic s

Retail sales are influenced by stores near the home

Newer customers are more sensitive to marketing

Google keywords perform worse than you think

Display advertising performs better than you think

Online sales benefit more from seasonal events than retail stores

9

Page 10: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR eal-time B ig DataP redic tive A nalytic s S tac k

10

Page 11: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialData L ayer

11

Structured data RDBMS, NoSQL, Hbase, Impala

Unstructured data Hadoop MapReduce

Streaming data Web, social, sensors, operational systems

Some descriptive analytics done here

Page 12: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialA nalytic s layer

Predictive analytics technology Development environment: build models Production environment: Deploy real-time scoring Engine for dynamic analytics

Local data mart Static, periodically updated from data layer Improves performance

12

Page 13: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialIntegration L ayer

Connective tissue between end-user applications and analytics engine Engines for flow control, real-time scoring Rules engine / CEP engine

API for Dynamic Analytics Brokers communication between app developers

and data scientists

13

Page 14: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialDec is ion L ayer

End-user interface to analytics system Customers Operations Business analysts C-suite

Familiar & simple interfaces Supports a range of end-user technologies Level of UI complexity dictated by need

14

Page 15: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR eal-Time Deployment

Five phases of deploying real-time predictive analytics with big data to production:

1. Data Distillation2. Model Development3. Validation and Deployment4. Real-time Scoring5. Model refresh

16

Page 16: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP has e 1: Data Dis tillation

The data in the Data Layer isn’t yet ready for predictive modeling. We need to: Extract features from unstructured text Topic modeling, sentiment analysis

Combine disparate data sources Filter for populations of interest Select relevant features and outcomes for

modeling Export to the data mart for predictive modeling

17

Page 17: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialE xample: Data Dis tillation in Hadoop

18

UnstructuredData

Analytics Data Mart

Structured Data

Log Files

Event Streams

Scraped Text

HDFS Load Map-Reduce

Revolution R Enterprise for Hadoop: bit.ly/r-hadoop

rmr

Page 18: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialA utomating the E xtrac tion P roc es s This process needs to be repeatable and

maintainable Create a re-usable R script: ASCII: rxDataStep SQL: rxImport, ROracle, RMySQL NoSQL: Rcassandra, rmongodb Hadoop: rmr, rhbase, Rhive Appliances: nza, teradataR, PL/R Feeds: rjson, XML, RCurl, twitteR, python

R script runs from Analytics Layer Processing: in Data Layer Output: structured file for Data Mart

19

Page 19: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialData Dis tillation P roc es s

20

XDF

Page 20: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP has e 2: Model development

Goal: create a predictive model that is Powerful Robust Comprehensible Implementable

Key requirements for Data Scientists: Flexibility Productivity Speed Reproducibility

21

Page 21: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialT he Model Development C yc le

22

Variable Transformation

Model Estimation

Model Refinement

Model Comparison / Benchmarking

Feature SelectionSampling

AggregationData Mart Predictive

Model

Download the White PaperR is Hotbit.ly/r-is-hot

Page 22: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialTec hnology c ons iderations

Moving Big Data is slow So just move it once, to data mart near the development

and production environments Static data needed for modeling cycle But have a plan to refresh and update

Data layer not optimized for predictive analytics But good for descriptive (data distillation)

Revolution R Enterprise optimizations for the analytics layer R; XDF file format; Parallel External Memory Algorithms

23

Page 23: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP redictive Modeling A lgorithms

Algorithm Example Applications Big Data

Data Step ETL, data distillation, record/variable selection, variable transformation

Descriptive Statistics Exploratory Data Analysis, Data Validation

Tables & Cubes Reporting, contingency analysis

Correlation / Covariance Factor Analysis, Value at Risk

Linear regression Forecasting, Net present value estimation

Logistic Regression Response modeling, offer selection

Generalized LinearModels Capital reserve estimation, climate modeling

K-means clustering Customer Segmentation

Decision Trees Dynamic pricing, classification, variable importance

Model Prediction Real-time Scoring (decisions, offers, actions)

R CRAN packages Everything else

Parallel & distributed computing with R

Simulations, By-Group analysis, ensemble models, custom applications

24

Page 24: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialHigh performance with dis tributed c lus ters

25

Data Scientists / Modelers Grid computing cluster

Revolution R EnterprisePlatform LSFMicrosoft HPC ServerMicrosoft AzureAd-hoc grids (SNOW)Shared SMP serverHadoop / HDFS

Page 25: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP has e 3: Model Validation and Deployment

Scoring rules map parameters to outputs Parameters: information known at real time prospect ID, product, webpage, …

Scores: values inferred in real time prices, recommendations, actions, …

Validation: backtest with historical data Deployment: code running in real time

26

Page 26: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialValidation for produc tion

Refresh data mart Rebuild model, withholding a validation set Random sampling

Create accuracy measure ROC, positive rate, false positive rate

Measure and monitor

27

His

toric

al

Dat

a

Sco

res

Scoring Rules

Rea

l Out

com

es

Page 27: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialDeployment with R c ode

Scoring rules captured as “R model objects” Move R code directly from development

environment to production server No recoding: Lowest cost, greatest speed &

reliability Most flexibility (not limited in model choice) Easy to validate with custom accuracy measures

28

New

D

ata

Sco

resR Model

Object

His

toric

al

Dat

a

Out

com

es R Model Object

Page 28: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialManaging and S caling Deployed R C ode

29

Revolution R Enterprise includes RevoDeployR Code management & tracking Security & resource allocation Scalability for real-time demands Web Services API

Scoring and dynamic analytics Data visualization custom reporting Ad-hoc analytics Interactive data apps

Page 29: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialC ode C onvers ion for S c oring

Business rules (e.g. IBM ILOG) Create directly with R code from model object

Recode in other languages SQL C++, etc.

Low latency Slow and costly

deployment Potential for errors

30

Page 30: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialS c oring via P MML Some predictive models can be expressed in

the PMML standard (www.dmg.org) Not all, but growing all the time

Easily generate PMML with R “pmml” package PMML Scoring Engine: Zementis Adapa Databases and appliances may support PMML But supported standards vary

31

Dat

a

Sco

res

PMMLPMMLR Model Object pmml

Page 31: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP has e 4: R eal-Time S c oring

Scoring triggered by Decision Layer, brokered by Integration Layer R code

Revolution R Enterprise Server In-appliance (e.g IBM PureData System for Analytics)

SQL In-database

Rules Compiled code

Bespoke engines

May be using hardware from data layer But not the actual data

32

Page 32: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR eal-TimeS c oring: review

33

SQL

iLog

Near Real Time

Page 33: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialP has e 5: Model R efres h

Deployed scoring rules no longer connected to Data Layer or Data Mart Enables real-time performance Need to refresh using recent data

Model refresh process Use data extract script Re-run model script Statistical review OR automated validation

34

Page 34: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialS c heduled B atc h Updates

Periodic model refresh Weekly Daily Hourly

Automated validate/deploy process

35

ExcelRevoDeployR Server

Web Services API

Scheduler

Revolution R EnterpriseProduction Server Cluster

Page 35: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR e-developing the model

Even then, the underlying structure of the data may change: Important variables become non-significant Non-significant variables become important! New data sources become available

Model accuracy drift is a warning sign Return to Phase 2 or Phase 1

36

Page 36: Real-time Big Data Analytics: From Deployment to Production

Revolution Confidential

How big is ‘B ig data’?

38

Petabytes Exabytes

Gigabytes Terabytes

Kilobytes / second

Megabytes / second

Megabytes

Page 37: Real-time Big Data Analytics: From Deployment to Production

Revolution Confidential

How fas t is ‘R eal time’?

39

Weeks Days

Hours Minutes

Seconds or less

Milliseconds

Daily Hourly

Daily Hourly

Page 38: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR ec ommendations

Architect your Predictive Analytics Stack Best of breed vs single-vendor stack Diversify data sources and decision apps

R = Predictive AnalyticsRevolution R Enterprise provides: Analytics Layer: big data, performance Integration Layer: scalability, reliability

Get Help to Get Started Consider an SI partner for architecture, best

practices and implementation advice

40

Page 39: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialR es ourc es Revolution R Enterprise : R for Big Data www.revolutionanalytics.com/products

Revolution Analytics Consulting Services www.revolutionanalytics.com/services Contact us: bit.ly/hey-revo

Rhadoop : Connecting R and Hadoop bit.ly/r-hadoop

Contact David Smith [email protected] @revodavid blog.revolutionanalytics.com

41

Page 40: Real-time Big Data Analytics: From Deployment to Production

Revolution ConfidentialT hank you.

42

www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.