architectural elements enabling r for big data · transparent data access, analysis, and...
TRANSCRIPT
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Architectural Elements Enabling R for Big Data Mark Hornick Director, Advanced Analytics and Machine Learning [email protected] @MarkHornick blogs.oracle.com/R July 2017
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
2
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. | 3
https://en.wikipedia.org/wiki/Big_data
https://www.google.com/search?q=big+data
Volume
Variety
Velocity Variability
Veracity
Value
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. | 4
Transparent Data Access, Analysis, and Exploration
Scalable Machine Learning
Data & Task Parallelism
Production Deployment
Architectural elements supporting…
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Transparent data access, analysis, and exploration
5
All Processing
Light local Processing
Maintain language features and interface – translate R invocations Offload processing to more powerful machines and environments using data.frame proxies Avoid data movement
RDBMS SQL
Hadoop HiveQL Spark
Java, Python, etc
Main Processing
Main Processing
Main Processing
…
Data Source
OAA/ORE, dplyr ORAAH, sparklyr ORAAH
Data must fit in RAM with enough RAM left over to do work Moving Big Data takes time
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Proxy objects for Big Data
6
data.frame
Proxy data.frame
Inherits from
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Transparency Examples
7
library(ORE)
ore.connect("rquser", "orcl", "localhost",
"rquser", all=TRUE)
ore.ls()
# lists ONTIME_S, which is ore.frame
df <- with(ONTIME_S,
ONTIME_S[DEST=="SFO"|DEST=="BOS",1:21])
df$LRGDELAY <- ifelse(df$ARRDELAY>20,1,0)
head(df)
summary(df)
hist(MY_TABLE$ARRDELAY,breaks=100)
merge (TEST_DF1, TEST_DF2,
by.x="x1", by.y="x2")
# OREdplyr in ORE 1.5.1
select(FLIGHTS, year, month, dep_delay)
rename(FLIGHTS, tail_num = tailnum)
filter(FLIGHTS, month == 1, day == 1)
arrange(FLIGHTS, year, month, day)
mutate(FLIGHTS, speed=air_time/distance)
# OAAgraph with ORE 1.5.1
graph <- oaa.graph(EDGES1,NODES1)
betweenness(graph, name="betweenness")
pagerank(graph, 0.0001, 0.85, 100,
name="PR")
oaa.create(graph, nodeTableName="NODE2",
nodeProperties = c("PR"))
head(NODE2)
ore.frame
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Scalable Machine Learning
8
All Processing
Data .Rdata
Light local Processing
Offload model build and score to parallel algorithms and powerful machines Maintain R ML interface using formula with transformations, interaction terms, etc. Proxy objects to reference data
RDBMS Hadoop Spark
Main Processing
Main Processing
Main Processing
PGX
Main Processing
OAA/ORE ORAAH,sparklyr OAAgraph ORAAH
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Linear Model Performance Comparison
9
Algorithm Threads
Used*
Memory
required**
Time for Data
Loading***
Time for
Computation Total
Relative
Performance
Open-Source R Linear Model (lm) 1 220Gb 1h3min 43min 1h46min 1x
Oracle R Enterprise lm (ore.lm) 1 - - 42.8min 42.8min 2.47X
Oracle R Enterprise lm (ore.lm) 32 - - 1min34s 1min34s 67.7X
Oracle R Enterprise lm (ore.lm) 64 - - 57.97s 57.97s 110X
Oracle R Enterprise lm (ore.lm) 128 - - 41.69s 41.69s 153X
• Predict “Total Revenue” of a customer based on 31 numeric variables as predictors,
on 184 million records using SPARC T5-8, 4TB of RAM
• Data in an Oracle Database table
* Open-source R lm() is single threaded
** Data moved into the R memory *** Time to load 40Gb of raw data into the open-source R memory
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Linear Model Performance Comparison
10
Algorithm Threads
Used*
Memory
required**
Time for Data
Loading***
Time for
Computation Total
Relative
Performance
Open-Source R Linear Model (lm) 1 220Gb 1h3min 43min 1h46min 1x
Oracle R Enterprise lm (ore.lm) 1 - - 42.8min 42.8min 2.47X
Oracle R Enterprise lm (ore.lm) 32 - - 1min34s 1min34s 67.7X
Oracle R Enterprise lm (ore.lm) 64 - - 57.97s 57.97s 110X
Oracle R Enterprise lm (ore.lm) 128 - - 41.69s 41.69s 153X
• Predict “Total Revenue” of a customer based on 31 numeric variables as predictors,
on 184 million records using SPARC T5-8, 4TB of RAM
• Data in an Oracle Database table
* Open-source R lm() is single threaded
** Data moved into the R memory *** Time to load 40Gb of raw data into the open-source R memory
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. | 11
Not all parallel implementations are the same Compare performance with varying Spark memory footprints
Benchmark on single X5-2 Node with 74 threads and 256 GB of Total RAM, Spark 1.6.0 on CDH 5.8.0
15GB ”Ontime” airline dataset with 123million records, predicting 8,926 total coefficients
30 63 65 67 86 112
160 226
DNF 999+
WNR -
100
200
300
400
500
600
700
800
900
1,000
48 24 12 8 4
GLM in Spark Performance
ORAAH GLM
Spark Mllib GLM
21 21 24 28 29
126 138
264
908
WNR -
100
200
300
400
500
600
700
800
900
1,000
48 24 12 8 4
LM in Spark Performance
ORAAH LM
Spark MLlib LM
Spark Context memory in GB Spark Context memory in GB
Tim
e in
Sec
on
ds
Tim
e in
Sec
on
ds
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Data and Task Parallelism
12
All Processing
Data .Rdata
Store UDF & Invoke
Execute user-defined R function on back-end servers Data-parallel and task-parallel execution Automated management of parallel R engines with load balancing Auto-partition and load data by value or count Still use CRAN packages in UDF Client optional – one less failure point
Spawn & control R engines Provide function and data
R Script Repository
Parallel multicore snow, snowFT …
Hand-code spawning of R engines and feed data
to R engines
Moving Big Data is slow or doesn’t fit in RAM
R Object Repository
… …
OAA/ORE
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise Embedded R Execution
13
ore.tableApply(
IRIS_TABLE, # ore.frame
function(dat,datastore) { # user-defined function
library(e1071)
dat$Species <- as.factor(dat$Species)
mod<-naiveBayes(Species ~ ., dat)
ore.save(mod,name=datastore) # save in database
},
datastore="NB_Model-1") # pass name of datastore
scoreNBmodel <- function(dat, datastore) {
library(e1071)
ore.load(datastore)
dat$PRED <- predict(mod, newdata = dat)
dat
}
IRIS_PRED <- IRIS_TABLE[1,]
IRIS_PRED$PRED <- "A" # define form of result data.frame
res <- ore.rowApply(IRIS_TABLE, scoreNBmodel, datastore = "NB_Model-1",
parallel=4, FUN.VALUE=IRIS_PRED, rows=10)
class(res) # ore.frame
ore.create(res, table="IRIS_SCORES") # materialize scores
table(IRIS_SCORES$Species, IRIS_SCORES$PRED) # compute confusion matrix
No parallelism
Data parallel by chunk
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise Embedded R Execution
14
DAT <- ONTIME_S[ONTIME_S$DEST %in%
c("BOS","SFO","LAX","ORD","ATL","PHX","DEN"),]
modList <- ore.groupApply(
X=DAT, INDEX=DAT$DEST, parallel=3,
function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat))
Data parallel by partition
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Deployment
15
Use SQL to invoke R from enterprise applications Seamlessly return data.frames, images, XML as database tables via SQL Schedule R function execution Facilitate hand-off between Data Scientists and App Developers
Spawn R Engines
Enterprise Application
C, C++, Java,
python
Data Stores
rJava API RServe rpy2 system() cron …
Enterprise Application
C, C++, Java,
python
SQL
Results data.frame
images R objects
Datastore
R Script Repository
Invoke R easily from non-R environments
…
OAA/ORE
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Deploy R using SQL • Store named R function in Script Repository
from R or SQL
• Return values
– Images as PNG BLOB column
– data.frame content as database table
– XML with data.frame and images
• Invoke same UDF from R or SQL
16
begin
sys.rqScriptDrop('RandomRedDots');
sys.rqScriptCreate('RandomRedDots',
'function(){
id <- 1:10
plot(1:100, rnorm(100), pch = 21,
bg = "red", cex = 2, main="Random Red Dots")
data.frame(id=id, val=id / 100)
}');
end;
select ID, IMAGE
from table(rqEval( NULL,'PNG','RandomRedDots'));
select id, val
from table(rqEval( NULL,'select 1 id, 1 val from dual',
'RandomRedDots'));
-- Return structured and image content within XML string
select *
from table(rqEval(NULL, 'XML', 'RandomRedDots'));
# In R, invoke same function by name
ore.doEval(FUN.NAME='RandomRedDots')
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Deploy R using SQL
17
begin
sys.rqScriptDrop('RandomRedDots');
sys.rqScriptCreate('RandomRedDots',
'function(){
id <- 1:10
plot(1:100, rnorm(100), pch = 21,
bg = "red", cex = 2, main="Random Red Dots")
data.frame(id=id, val=id / 100)
}');
end;
select ID, IMAGE
from table(rqEval( NULL,'PNG','RandomRedDots'));
select id, val
from table(rqEval( NULL,'select 1 id, 1 val from dual',
'RandomRedDots'));
-- Return structured and image content within XML string
select *
from table(rqEval(NULL, 'XML', 'RandomRedDots'));
# In R, invoke same function by name
ore.doEval(FUN.NAME='RandomRedDots')
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Architectural Elements: Enabling R for Big Data • Leverage powerful back-ends for the
heavy lifting…transparently from R
• Use parallel, distributed ML and graph algorithms
• Enable parallelism quickly and easily for big data processing
• Immediately leverage data scientist R scripts and results in production environments
18
Transparent Data Access, Analysis, and Exploration
Scalable Machine Learning
Data & Task Parallelism
Production Deployment
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise Component of Oracle Advanced Analytics option to Oracle Database, Oracle Exadata Cloud Service, Oracle Database Cloud Service
• Use Oracle Database as HPC environment
• Use in-database parallel and distributed machine learning algorithms
• Manage R scripts and R objects in Oracle Database
• Integrate R results into applications and dashboards via SQL
19
Oracle Database User tables
In-db stats
Database Server Machine
SQL Interfaces SQL*Plus, SQLDeveloper, …
R Client
Oracle R Enterprise
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Oracle Advanced Analytics In-Database Machine Learning Algorithms*— SQL, R and UI Access
Decision Tree Logistic Regression (GLM) Naïve Bayes Support Vector Machine (SVM) Random Forest
Multiple Regression (GLM) Support Vector Machine (SVM) Linear Model Generalized Linear Model Multi-Layer Neural Networks Stepwise Linear Regression
Minimum Description Length Unsupervised pair-wise KL div.
Hierarchical k-Means Orthogonal Partitioning Clustering Expectation-Maximization
Nonnegative Matrix Factorization Principal Component Analysis Singular Value Decomposition Explicit Semantic Analysis
Apriori – Association Rules
1 Class Support Vector Machine
Single & Double Exp. Smoothing
Classification Clustering
Clustering Regression Anomaly Detection Feature Extraction
Ability to run any R package via Embedded R Execution
* supports partitioned models, text mining
20
Regression
Attribute Importance
Feature Extraction & Creation
Anomaly Detection
Time Series
Market Basket Analysis
Predictive Queries
Open Source R Algorithms
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Oracle Spatial and Graph: PGX graph functions For use with OAAgraph integrated with Oracle R Enterprise
Detecting Components and Communities
Tarjan’s, Kosaraju’s, Weakly Connected Components, Label Propagation (w/ variants), Soman and Narang’s
Ranking and Walking
Pagerank, Personalized Pagerank, Betweenness Centrality (w/ variants), Closeness Centrality, Degree Centrality, Eigenvector Centrality, HITS, Random walking and sampling (w/ variants)
Evaluating Community Structures
∑ ∑
Conductance, Modularity Clustering Coefficient (Triangle Counting) Adamic-Adar
Path-Finding
Hop-Distance (BFS) Dijkstra’s, Bi-directional Dijkstra’s Bellman-Ford’s
Link Prediction SALSA (Twitter’s Who-to-follow)
Other Classics Vertex Cover Minimum Spanning-Tree(Prim’s)
21
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Cluster with Oracle R Advanced Analytics for Hadoop
Oracle R Advanced Analytics for Hadoop Component of Oracle Big Data Connectors, Oracle Big Data Cloud Service
R Client
SQL Client
HQL
Oracle Database with Advanced Analytics option
R interface to HQL Basic Statistics, Data Prep, Joins and View creation
Parallel, distributed algorithms with
Spark Caching and Spark MLlib algs
Use of Open-source R packages within custom R Mappers / Reducers
R Analytics Oracle R Advanced Analytics
for Hadoop
SQL Developer Other SQL Apps
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
• Logistic Regression ORAAH
Regression Classification
• Neural Networks ORAAH
Attribute Importance
• Principal Component Analysis
Clustering
• Hierarchical k-Means
Feature Extraction
• Non-negative Matrix Factorization
• Collaborative Filtering (LMF)
23
Basic Statistics
• Correlation/Covariance
• GLM ORAAH
• Hierarchical k-Means • Gaussian Mixture Models
• Principal Component Analysis
• LASSO • Ridge Regression • Support Vector Machine • Random Forest • Linear Regression
Machine Learning algorithms
Oracle R Advanced Analytics for Hadoop 2.7.0
• Logistic Regression • Random Forests • Decision Trees • Support Vector Machine
• SVD
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. |
Learn More about Oracle’s Advanced Analytics R Technologies…
24
http://oracle.com/goto/R