Recent Developments in SparkR for Advanced Analytics
Xiangrui Meng [email protected]
2016/07/15 - Silicon Valley Machine Learning Meetup
About Apache Spark
2
• General engine for large-scale data processing under the MapReduce framework
• Resilient Distributed Dataset (RDD) with in-memory & on-disk caching
• Concise APIs in Python, Scala, Java, and R
• Apache open source license
• Spark 2.0 coming soon!
Spark
Spark SQL Streaming MLlib GraphX SparkR
Apache Spark is the Taylor Swift of Big Data Software
- Derrick Harris, Fortune
4
About Databricks
• Founded by the team who created Apache Spark • Offers a hosted service • Apache Spark in the cloud • Notebooks • Cluster management • Production environment
• Free Community Edition • http://databricks.com/try
5
About Me
• Software Engineer at Databricks • tech lead of machine learning and data science
• Committer and PMC member of Apache Spark • Ph.D. from Stanford in computational mathematics
6
Outline
• Introduction to SparkR •Descriptive analytics in SparkR •Predictive analytics in SparkR • Future directions
7
Introduction to SparkR
Bridging the gap between R and Big Data
SparkR
• Introduced to Spark since 1.4 • Wrappers over DataFrames and DataFrame-based APIs
• In SparkR, we make the APIs similar to existing ones in R (or R packages), rather than Python/Java/Scala APIs. • R is very convenient for analytics and users love it. • Scalability is the main issue, not the API.
9
DataFrame-based APIs
• Storage: S3 / HDFS / local / … • Data sources: csv / parquet / json / … • DataFrame operations: • select / subset / groupBy / agg / collect / … • rand / sample / avg / var / …
• Conversion to/from R data.frame
10
DataFrame-based APIs
Compute average age per department
• df.groupBy(“dept”).avg(“age”) # Python • df.groupBy(“dept”).avg(“age”) // Scala • df.groupBy(“dept”).avg(“age”); // Java
• avg(groupBy(df, “dept”), “age”) # R • df %>% groupBy(“dept”) %>% avg(“age”) # R with magrittr
11
+----+--------+ |dept|avg(age)| +----+--------+ | eng| 25.0| | ops| 30.0| +----+--------+
SparkR Architecture
12
Spark Driver
R JVM
R Backend
JVM
Worker
JVM
Worker
Data Sources
Data Conversion between R and SparkR
13
R JVM
R BackendSparkR::collect()
SparkR::createDataFrame()
Descriptive Analytics
Big Data at a glimpse in SparkR
Summary Statistics
15
• count, min, max, mean, standard deviation, variance describe(df)
df %>% groupBy(“dept”, avgAge = avg(df$age))
• covariance, correlation df %>% select(var_samp(df$x, df$y))
• skewness, kurtosis df %>% select(skewness(df$x), kurtosis(df$x))
Sampling Algorithms
• Bernoulli sampling (without replacement)df %>% sample(FALSE, 0.01)
• Poisson sampling (with replacement)df %>% sample(TRUE, 0.01)
• stratified samplingdf %>% sampleBy(“key”, c(positive = 1.0, negative = 0.1))
16
Approximate Algorithms
• frequent items [Karp03] df %>% freqItems(c(“title”, “gender”), support = 0.01)
• approximate quantiles [Greenwald01] df %>% approxQuantile(“value”, c(0.1, 0.5, 0.9), relErr = 0.01)
• single pass with aggregate pattern • trade-off between accuracy and space
17
Implementation: Aggregation Pattern
split + aggregate + combine in a single pass • split data into multiple partitions • calculate partially aggregated result on each partition • combine partial results into final result
18
Implementation: High-Performance
• new online update formulas of summary statistics • code generation to achieve high performance
kurtosis of 1 billion values on a Macbook Pro (2 cores):
19
scipy.stats 250s
octave 120s
CRAN::moments 70s
SparkR / Spark / PySpark 5.5s
Predictive Analytics
Enabling large-scale machine learning in SparkR
MLlib + SparkR
MLlib and SparkR integration started in Spark 1.5.
API design choices: 1. mimic the methods implemented in R or R packages
• no new method to learn • similar but not the same / shadows existing methods • inconsistent APIs
2. create a new set of APIs
21
MLlib Algorithm Coverage• Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron
• Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Survival regression • Streaming linear methods
• Frequent pattern mining • FP-growth • PrefixSpan
22
Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering • Bisecting k-means
Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation • Kolmogorov–Smirnov test Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices
• Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix
• Matrix decompositions
Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer • One-Hot Encoder • StringIndexer • VectorIndexer • VectorAssembler • Binarizer • Bucketizer • ElementwiseProduct • PolynomialExpansion
Model import/export Pipelines
List based on Spark 1.6
Generalized Linear Models (GLMs)
• Linear models are simple but extremely popular. • A GLM is specified by the following: • a distribution of the response (from the exponential family), • a link function g such that
• maximizes the sum of log-likelihoods
23
Distributions and Link Functions
SparkR supports all families supported by R in Spark 2.0.
24
Model Distribution Link
linear least squares normal identity
logistic regression binomial logit
Poisson regression Poisson log
gamma regression gamma inverse
… … …
GLMs in SparkR
# Create the DataFrame for training df <- read.df(sqlContext, “path/to/training”)
# Fit a Gaussian linear model model <- glm(y ~ x1 + x2, data = df, family = “gaussian”) # mimic R model <- spark.glm(df, y ~ x1 + x2, family = “gaussian”)
# Get the model summary summary(model)
# Make predictions predict(model, newDF)
25
Implementation: SparkR::glm
The `SparkR::glm` is a simple wrapper over an ML pipeline that consists of the following stages:
• RFormula, which itself embeds an ML pipeline for feature preprocessing and encoding,
• an estimator (GeneralizedLinearRegression).
26
RWrapper
Implementation: SparkR::glm
27
RFormula
GLM
RWrapper
RFormula
GLM
StringIndexer
VectorAssembler
IndexToString
StringIndexer
Implementation: R Formula
28
• R provides model formula to express models. • We support the following R formula operators in SparkR:
• `~` separate target and terms • `+` concat terms, "+ 0" means removing intercept • `-` remove a term, "- 1" means removing intercept • `:` interaction (multiplication for numeric values, or binarized
categorical values) • `.` all columns except target
• The implementation is in Scala.
GLM: Row-based Distributed Storage
29
w x y
w x y
partition 1
partition 2
GLM: Gradient Descent Methods
• Stochastic gradient descent (SGD): • trade-offs on the merge scheme and convergence
• Mini-batch SGD: • hard to sample mini-batches efficiently • communication overhead on merging gradients
• Batch gradient descent: • slow convergence
30
GLM: Quasi-Newton methods
• Newton’s method converges much than GD, but it requires second-order information: • L-BFGS works for smooth objectives. It approximates the
inverse Hessian using only first-order information. • OWL-QN works for objectives with L1 regularization. • MLlib calls L-BFGS/OWL-QN implemented in breeze.
31
Direct Methods for Linear Least Squares
• Linear least squares has an analytic solution:
• The solution could be computed directly or through QR factorization, both of which are implemented in Spark. • requires only a single pass • efficient when the number of features is small (<4000) • provides R-like model summary statistics
32
GLM: Iteratively Re-weighted Least Squares
• Generalized linear models with exponential family can be solved via iteratively re-weighted least squares (IRLS). • linearizes the objective at the current solution • solves the weighted linear least squares problem • repeat above steps until convergence
• efficient when the number of features is small (<4000) • provides R-like model summary statistics • This is the implementation in R.
33
Standardization
To match the result in both R and glmnet, the most popular R package for GLMs, we provide options to standardize features and labels before training:where delta is the stddev of labels, and sigma_j is the stddev of the j-th feature column.
34
Implementation: Test against R
Besides normal tests, we also verify our implementation using R.
/* df <- as.data.frame(cbind(A, b)) for (formula in c(b ~ . -1, b ~ .)) { model <- lm(formula, data=df, weights=w) print(as.vector(coef(model))) } [1] -3.727121 3.009983 [1] 18.08 6.08 -0.60 */ val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983), Vectors.dense(18.08, 6.08, -0.60))
35
ML Models in SparkR (Spark 2.0)
• generalized linear models (GLMs) • glm / spark.glm (stats::glm)
• accelerated failure time (AFT) model for survival analysis • spark.survreg (survival)
• k-means clustering • spark.kmeans (stats:kmeans)
• Bernoulli naive Bayes • spark.naiveBayes (e1071)
36
Model Persistence in SparkR
• model persistence supported for all ML models in SparkR • thin wrappers over pipeline persistence from MLlib
model <- spark.glm(df, x ~ y + z, family = “gaussian”)
write.ml(model, path)
model <- read.ml(path)
summary(model)
• feasible to pass saved models to Scala/Java engineers
37
Work with R Packages in SparkR
• There are ~8500 community packages on CRAN. • It is impossible for SparkR to match all existing features.
• Not every dataset is large. • Many people work with small/medium datasets.
• SparkR helps in those scenarios by: • connecting to different data sources, • filtering or downsampling big datasets, • parallelizing training/tuning tasks.
38
Work with R Packages in SparkR
df <- sqlContext %>% read.df(…) %>% collect()
points <- data.matrix(df)
run_kmeans <- function(k) {
kmeans(points, centers=k)
}
kk <- 1:6
lapply(kk, run_kmeans) # R’s apply
spark.lapply(kk, run_kmeans) # parallelize the tasks on Spark
39
summary(this.talk)
• SparkR enables big data analytics on R • descriptive analytics on top of DataFrames • predictive analytics from MLlib integration
• SparkR works well with existing R packages
Thanks to the Apache Spark community for developing and maintaining SparkR: Alteryx, Berkeley AMPLab, Databricks, Hortonworks, IBM, Intel, etc, and individual contributors!!
40
Future Directions
• CRAN release of SparkR (WIP) • more consistent APIs with existing R packages: dplyr, etc • better R formula support (WIP) • more algorithms from MLlib: decision trees, ALS, etc (WIP) • better integration with existing R packages: gapply / UDFs (WIP) • integration with Spark packages: GraphFrames, CoreNLP, etc
We’d greatly appreciate feedback from the R community!
41
Try Apache Spark with Databricks
42
http://databricks.com/try
• Download a companion notebook of this talk at: http://dbricks.co/1rbujoD
• Try latest version of Apache Spark and preview of Spark 2.0
Thank you.• SparkR user guide on Apache Spark website • SparkR/MLlib roadmap for Spark 2.1
• Databricks Community Edition and blog posts