architectural elements enabling r for big data · transparent data access, analysis, and...

25
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. | Architectural Elements Enabling R for Big Data Mark Hornick Director, Advanced Analytics and Machine Learning [email protected] @MarkHornick blogs.oracle.com/R July 2017

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Architectural Elements Enabling R for Big Data Mark Hornick Director, Advanced Analytics and Machine Learning [email protected] @MarkHornick blogs.oracle.com/R July 2017

Page 2: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

2

Page 3: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. | 3

https://en.wikipedia.org/wiki/Big_data

https://www.google.com/search?q=big+data

Volume

Variety

Velocity Variability

Veracity

Value

Page 4: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. | 4

Transparent Data Access, Analysis, and Exploration

Scalable Machine Learning

Data & Task Parallelism

Production Deployment

Architectural elements supporting…

Page 5: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Transparent data access, analysis, and exploration

5

All Processing

Light local Processing

Maintain language features and interface – translate R invocations Offload processing to more powerful machines and environments using data.frame proxies Avoid data movement

RDBMS SQL

Hadoop HiveQL Spark

Java, Python, etc

Main Processing

Main Processing

Main Processing

Data Source

OAA/ORE, dplyr ORAAH, sparklyr ORAAH

Data must fit in RAM with enough RAM left over to do work Moving Big Data takes time

Page 6: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Proxy objects for Big Data

6

data.frame

Proxy data.frame

Inherits from

Page 7: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Transparency Examples

7

library(ORE)

ore.connect("rquser", "orcl", "localhost",

"rquser", all=TRUE)

ore.ls()

# lists ONTIME_S, which is ore.frame

df <- with(ONTIME_S,

ONTIME_S[DEST=="SFO"|DEST=="BOS",1:21])

df$LRGDELAY <- ifelse(df$ARRDELAY>20,1,0)

head(df)

summary(df)

hist(MY_TABLE$ARRDELAY,breaks=100)

merge (TEST_DF1, TEST_DF2,

by.x="x1", by.y="x2")

# OREdplyr in ORE 1.5.1

select(FLIGHTS, year, month, dep_delay)

rename(FLIGHTS, tail_num = tailnum)

filter(FLIGHTS, month == 1, day == 1)

arrange(FLIGHTS, year, month, day)

mutate(FLIGHTS, speed=air_time/distance)

# OAAgraph with ORE 1.5.1

graph <- oaa.graph(EDGES1,NODES1)

betweenness(graph, name="betweenness")

pagerank(graph, 0.0001, 0.85, 100,

name="PR")

oaa.create(graph, nodeTableName="NODE2",

nodeProperties = c("PR"))

head(NODE2)

ore.frame

Page 8: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Scalable Machine Learning

8

All Processing

Data .Rdata

Light local Processing

Offload model build and score to parallel algorithms and powerful machines Maintain R ML interface using formula with transformations, interaction terms, etc. Proxy objects to reference data

RDBMS Hadoop Spark

Main Processing

Main Processing

Main Processing

PGX

Main Processing

OAA/ORE ORAAH,sparklyr OAAgraph ORAAH

Page 9: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Linear Model Performance Comparison

9

Algorithm Threads

Used*

Memory

required**

Time for Data

Loading***

Time for

Computation Total

Relative

Performance

Open-Source R Linear Model (lm) 1 220Gb 1h3min 43min 1h46min 1x

Oracle R Enterprise lm (ore.lm) 1 - - 42.8min 42.8min 2.47X

Oracle R Enterprise lm (ore.lm) 32 - - 1min34s 1min34s 67.7X

Oracle R Enterprise lm (ore.lm) 64 - - 57.97s 57.97s 110X

Oracle R Enterprise lm (ore.lm) 128 - - 41.69s 41.69s 153X

• Predict “Total Revenue” of a customer based on 31 numeric variables as predictors,

on 184 million records using SPARC T5-8, 4TB of RAM

• Data in an Oracle Database table

* Open-source R lm() is single threaded

** Data moved into the R memory *** Time to load 40Gb of raw data into the open-source R memory

Page 10: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Linear Model Performance Comparison

10

Algorithm Threads

Used*

Memory

required**

Time for Data

Loading***

Time for

Computation Total

Relative

Performance

Open-Source R Linear Model (lm) 1 220Gb 1h3min 43min 1h46min 1x

Oracle R Enterprise lm (ore.lm) 1 - - 42.8min 42.8min 2.47X

Oracle R Enterprise lm (ore.lm) 32 - - 1min34s 1min34s 67.7X

Oracle R Enterprise lm (ore.lm) 64 - - 57.97s 57.97s 110X

Oracle R Enterprise lm (ore.lm) 128 - - 41.69s 41.69s 153X

• Predict “Total Revenue” of a customer based on 31 numeric variables as predictors,

on 184 million records using SPARC T5-8, 4TB of RAM

• Data in an Oracle Database table

* Open-source R lm() is single threaded

** Data moved into the R memory *** Time to load 40Gb of raw data into the open-source R memory

Page 11: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. | 11

Not all parallel implementations are the same Compare performance with varying Spark memory footprints

Benchmark on single X5-2 Node with 74 threads and 256 GB of Total RAM, Spark 1.6.0 on CDH 5.8.0

15GB ”Ontime” airline dataset with 123million records, predicting 8,926 total coefficients

30 63 65 67 86 112

160 226

DNF 999+

WNR -

100

200

300

400

500

600

700

800

900

1,000

48 24 12 8 4

GLM in Spark Performance

ORAAH GLM

Spark Mllib GLM

21 21 24 28 29

126 138

264

908

WNR -

100

200

300

400

500

600

700

800

900

1,000

48 24 12 8 4

LM in Spark Performance

ORAAH LM

Spark MLlib LM

Spark Context memory in GB Spark Context memory in GB

Tim

e in

Sec

on

ds

Tim

e in

Sec

on

ds

Page 12: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Data and Task Parallelism

12

All Processing

Data .Rdata

Store UDF & Invoke

Execute user-defined R function on back-end servers Data-parallel and task-parallel execution Automated management of parallel R engines with load balancing Auto-partition and load data by value or count Still use CRAN packages in UDF Client optional – one less failure point

Spawn & control R engines Provide function and data

R Script Repository

Parallel multicore snow, snowFT …

Hand-code spawning of R engines and feed data

to R engines

Moving Big Data is slow or doesn’t fit in RAM

R Object Repository

… …

OAA/ORE

Page 13: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise Embedded R Execution

13

ore.tableApply(

IRIS_TABLE, # ore.frame

function(dat,datastore) { # user-defined function

library(e1071)

dat$Species <- as.factor(dat$Species)

mod<-naiveBayes(Species ~ ., dat)

ore.save(mod,name=datastore) # save in database

},

datastore="NB_Model-1") # pass name of datastore

scoreNBmodel <- function(dat, datastore) {

library(e1071)

ore.load(datastore)

dat$PRED <- predict(mod, newdata = dat)

dat

}

IRIS_PRED <- IRIS_TABLE[1,]

IRIS_PRED$PRED <- "A" # define form of result data.frame

res <- ore.rowApply(IRIS_TABLE, scoreNBmodel, datastore = "NB_Model-1",

parallel=4, FUN.VALUE=IRIS_PRED, rows=10)

class(res) # ore.frame

ore.create(res, table="IRIS_SCORES") # materialize scores

table(IRIS_SCORES$Species, IRIS_SCORES$PRED) # compute confusion matrix

No parallelism

Data parallel by chunk

Page 14: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise Embedded R Execution

14

DAT <- ONTIME_S[ONTIME_S$DEST %in%

c("BOS","SFO","LAX","ORD","ATL","PHX","DEN"),]

modList <- ore.groupApply(

X=DAT, INDEX=DAT$DEST, parallel=3,

function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat))

Data parallel by partition

Page 15: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Deployment

15

Use SQL to invoke R from enterprise applications Seamlessly return data.frames, images, XML as database tables via SQL Schedule R function execution Facilitate hand-off between Data Scientists and App Developers

Spawn R Engines

Enterprise Application

C, C++, Java,

python

Data Stores

rJava API RServe rpy2 system() cron …

Enterprise Application

C, C++, Java,

python

SQL

Results data.frame

images R objects

Datastore

R Script Repository

Invoke R easily from non-R environments

OAA/ORE

Page 16: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Deploy R using SQL • Store named R function in Script Repository

from R or SQL

• Return values

– Images as PNG BLOB column

– data.frame content as database table

– XML with data.frame and images

• Invoke same UDF from R or SQL

16

begin

sys.rqScriptDrop('RandomRedDots');

sys.rqScriptCreate('RandomRedDots',

'function(){

id <- 1:10

plot(1:100, rnorm(100), pch = 21,

bg = "red", cex = 2, main="Random Red Dots")

data.frame(id=id, val=id / 100)

}');

end;

select ID, IMAGE

from table(rqEval( NULL,'PNG','RandomRedDots'));

select id, val

from table(rqEval( NULL,'select 1 id, 1 val from dual',

'RandomRedDots'));

-- Return structured and image content within XML string

select *

from table(rqEval(NULL, 'XML', 'RandomRedDots'));

# In R, invoke same function by name

ore.doEval(FUN.NAME='RandomRedDots')

Page 17: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Deploy R using SQL

17

begin

sys.rqScriptDrop('RandomRedDots');

sys.rqScriptCreate('RandomRedDots',

'function(){

id <- 1:10

plot(1:100, rnorm(100), pch = 21,

bg = "red", cex = 2, main="Random Red Dots")

data.frame(id=id, val=id / 100)

}');

end;

select ID, IMAGE

from table(rqEval( NULL,'PNG','RandomRedDots'));

select id, val

from table(rqEval( NULL,'select 1 id, 1 val from dual',

'RandomRedDots'));

-- Return structured and image content within XML string

select *

from table(rqEval(NULL, 'XML', 'RandomRedDots'));

# In R, invoke same function by name

ore.doEval(FUN.NAME='RandomRedDots')

Page 18: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Architectural Elements: Enabling R for Big Data • Leverage powerful back-ends for the

heavy lifting…transparently from R

• Use parallel, distributed ML and graph algorithms

• Enable parallelism quickly and easily for big data processing

• Immediately leverage data scientist R scripts and results in production environments

18

Transparent Data Access, Analysis, and Exploration

Scalable Machine Learning

Data & Task Parallelism

Production Deployment

Page 19: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise Component of Oracle Advanced Analytics option to Oracle Database, Oracle Exadata Cloud Service, Oracle Database Cloud Service

• Use Oracle Database as HPC environment

• Use in-database parallel and distributed machine learning algorithms

• Manage R scripts and R objects in Oracle Database

• Integrate R results into applications and dashboards via SQL

19

Oracle Database User tables

In-db stats

Database Server Machine

SQL Interfaces SQL*Plus, SQLDeveloper, …

R Client

Oracle R Enterprise

Page 20: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Oracle Advanced Analytics In-Database Machine Learning Algorithms*— SQL, R and UI Access

Decision Tree Logistic Regression (GLM) Naïve Bayes Support Vector Machine (SVM) Random Forest

Multiple Regression (GLM) Support Vector Machine (SVM) Linear Model Generalized Linear Model Multi-Layer Neural Networks Stepwise Linear Regression

Minimum Description Length Unsupervised pair-wise KL div.

Hierarchical k-Means Orthogonal Partitioning Clustering Expectation-Maximization

Nonnegative Matrix Factorization Principal Component Analysis Singular Value Decomposition Explicit Semantic Analysis

Apriori – Association Rules

1 Class Support Vector Machine

Single & Double Exp. Smoothing

Classification Clustering

Clustering Regression Anomaly Detection Feature Extraction

Ability to run any R package via Embedded R Execution

* supports partitioned models, text mining

20

Regression

Attribute Importance

Feature Extraction & Creation

Anomaly Detection

Time Series

Market Basket Analysis

Predictive Queries

Open Source R Algorithms

Page 21: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Oracle Spatial and Graph: PGX graph functions For use with OAAgraph integrated with Oracle R Enterprise

Detecting Components and Communities

Tarjan’s, Kosaraju’s, Weakly Connected Components, Label Propagation (w/ variants), Soman and Narang’s

Ranking and Walking

Pagerank, Personalized Pagerank, Betweenness Centrality (w/ variants), Closeness Centrality, Degree Centrality, Eigenvector Centrality, HITS, Random walking and sampling (w/ variants)

Evaluating Community Structures

∑ ∑

Conductance, Modularity Clustering Coefficient (Triangle Counting) Adamic-Adar

Path-Finding

Hop-Distance (BFS) Dijkstra’s, Bi-directional Dijkstra’s Bellman-Ford’s

Link Prediction SALSA (Twitter’s Who-to-follow)

Other Classics Vertex Cover Minimum Spanning-Tree(Prim’s)

21

Page 22: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Cluster with Oracle R Advanced Analytics for Hadoop

Oracle R Advanced Analytics for Hadoop Component of Oracle Big Data Connectors, Oracle Big Data Cloud Service

R Client

SQL Client

HQL

Oracle Database with Advanced Analytics option

R interface to HQL Basic Statistics, Data Prep, Joins and View creation

Parallel, distributed algorithms with

Spark Caching and Spark MLlib algs

Use of Open-source R packages within custom R Mappers / Reducers

R Analytics Oracle R Advanced Analytics

for Hadoop

SQL Developer Other SQL Apps

Page 23: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

• Logistic Regression ORAAH

Regression Classification

• Neural Networks ORAAH

Attribute Importance

• Principal Component Analysis

Clustering

• Hierarchical k-Means

Feature Extraction

• Non-negative Matrix Factorization

• Collaborative Filtering (LMF)

23

Basic Statistics

• Correlation/Covariance

• GLM ORAAH

• Hierarchical k-Means • Gaussian Mixture Models

• Principal Component Analysis

• LASSO • Ridge Regression • Support Vector Machine • Random Forest • Linear Regression

Machine Learning algorithms

Oracle R Advanced Analytics for Hadoop 2.7.0

• Logistic Regression • Random Forests • Decision Trees • Support Vector Machine

• SVD

Page 24: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction

Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |

Learn More about Oracle’s Advanced Analytics R Technologies…

24

http://oracle.com/goto/R

Page 25: Architectural Elements Enabling R for Big Data · Transparent data access, analysis, and exploration 5 All Processing ... In-Database Machine Learning Algorithms* ... Feature Extraction