ejecutando lenguaje r en hadoop: bigr - meetupfiles.meetup.com/7770922/bigr.pdf · hadoop streaming...

25
© 2015 IBM Corporation Ejecutando Lenguaje R en Hadoop: BigR

Upload: others

Post on 25-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation

Ejecutando Lenguaje R en Hadoop: BigR

Page 2: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation2

What is Open Source R? What is CRAN?

R is a powerful programming language and environment for statistical computing and graphics.

R offers a rich analytics ecosystem:− Full analytics life-cycle

• Data exploration• Statistical analysis• Modeling, machine learning, simulations• Visualization

− Highly extensible via user-submitted packages• Tap into innovation pipeline contributed to by highly-regarded statisticians• Currently 4700+ statistical packages in repository• Easily accessible via CRAN, the Comprehensive R Archive Network

− R is the one of the fastest growing data analysis software• Deeply knowledgeable and supportive analytics community• The most popular software used in data analysis competitions• Gaining speed in corporate, government, and academic settings

Page 3: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation3

The Explainer: Data in Hadoop

You

R User

Distributed data

Page 4: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation4

Data in Hadoop: Open Source R on a single node

R User

You

Distributed data

Page 5: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation5

Challenges with Running Large-Scale Analytics

TRADITIONAL APPROACH BIG DATA APPROACH

Analyze small subsets of information

Analyze all information

Analyzedinformation

All available information

All available informationanalyzed

Page 6: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation6

Various Approaches to integrate R with Hadoop

Key Challenges of using R with Hadoop:

� In R the processing is fundamentally memory bound – Data Frames/Matrix are loaded in memory and all processing happens there.

� So it is hard to integrate R with a fundamentally distributed processing paradigm like Map Reduce (Hadoop)

Various Approaches to integrate R with Hadoop

� RHIPE – Open Source framework integrates R with Hadoop through MapReduce coding on the client-side

� Rhadoop/RMR - RHadoop is also provides an Open Source framework trying to integrate R with Hadoop at client side

� Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through the I/O stream.

� Oracle Enterprise R Connector for Hadoop – Licensed product. Essentially adoption of Oracle R to be able to work with any Hadoop distribution through MapReduce coding on the client side

� Big R – Licensed product. Proprietary mechanism to integrate with Hadoop without the need of MapReduce style R code.

Page 7: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation7

Sample Code – R with RHIPE for Hadoop integration

Original R Code

tempData <- read.table(“temperature.csv", header = TRUE, sep=“,’)

coltypes(tempData) = ifelse(1:10 %in% c(3, 4), numeric, character)

maxMin <- tempData[ , c(‘minTemp’, ‘maxTemp’)]

tempData$avgTempDay <- rowMeans(maxMin)

avgTempCity <- aggregate (tempData$avgTempDay, by=list(city=tempData$city), FUN=mean)

write(avgTempCity, file = “output.csv", sep = “, “)

R Code using RHIPE packagelibrary(Rhipe)rhinit(TRUE,TRUE)

map <- expression ( { process_line <- function(currentLine) {fields <- unlist(strsplit(currentLine, ",")) maxMin <- c(as.double(fields[3]), as.double(fields[6]))rhcollect(fields[1], toString(mean(maxMin)))}lapply(map.values, process_line) } )

reduce <- expression(pre = {means <- numeric(0)}, reduce = {means <- c(means, as.numeric(unlist(reduce.values)))},post = {rhcollect(reduce.key, toString(mean(means)))})input_file <- “temparature.csv“output_dir <- "output.csv“

job <- rhmr(jobname = “TempAvg", map = map, reduce = reduce, ifolder = input_file, ofolder = output_dir, inout = c("text","sequence"))rhex(job)

Page 8: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation8

Text Analytics

POSIX Distributed Filesystem

Multi-workload, multi-tenant scheduling

IBM BigInsights

Enterprise Management

Machine Learning on Big R

Machine Learning on Big R

Big R (R support)

IBM Open Platform with Apache Hadoop*(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,

Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)

IBM BigInsights

Data Scientist

IBM BigInsights

Analyst

Big SQL

BigSheets

Industry standard SQL (Big SQL)

Spreadsheet-style tool (BigSheets)

Overview of BigInsights

Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support

Page 9: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation9

IBM BigInsights brings efficient integration of R with Big R

� R as a big data query language− Outside-in execution

� R as a statistical language for deep computing

− Inside-out execution

− Partitioning of large data (“divide”)

− Parallel cluster execution of pushed down R code (“conquer”)

− Almost any R package can run in this environment

� R as the gateway to scalable machine learning

− A scalable ML engine that provides canned algorithms, and an ability to author new ones, all via R

R Clients

Scalable ML

Engine

Data Sources

Embedded R Execution

R Packages

R Packages

Pull data (summaries) to

R client

Or, push R functions right

on the data

Page 10: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation10

Big R Architecture

1Scalable

AlgorithmsScalable Data

ProcessingNative

R functions

R UserInterface

2 3

Page 11: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation11

Sample Code – BigR

Code using Big R

library(bigr)

temperatureData <- bigr.frame(dataSource="DEL", dataPath="/user/temperature.csv", header=TRUE)

coltypes(temperatureData)=ifelse(1:10 %in% c(3, 6), "numeric", "character")

buildAvgTempFunc <- function(df) { maxMin <- df[ , c(‘minTemp’, ‘maxTemp’)]df$avgTempDay <- rowMeans(maxMin)avgTempCity <- aggregate (df$avgTempDay,

by=list(city=df$city), FUN=mean)return(data.frame(avgTempCity))

}

avgTemperature <- groupApply(temperatureData, temperatureData$city, buildAvgTempFunc, data.frame(city=“city", average_temperature=1.0))

bigr.persist(avgTemperature, dataSource="DEL", dataPath="/user/output.csv", header=T, del=',')

� This code (using Big R) achieves the same as the original R code on the same dataset in the csv file in HDFS.

� Note that the function call buildAvgTempFunc has same R code snippet as in original R code.

� The groupApply function is specific to bigr package. Other similar useful functions are rowApply and tableApply

Original R Code

tempData <- read.table(“temperature.csv", header = TRUE, sep=“,’)

coltypes(tempData) = ifelse(1:10 %in% c(3, 4), numeric, character)

maxMin <- tempData[ , c(‘minTemp’, ‘maxTemp’)]

tempData$avgTempDay <- rowMeans(maxMin)

avgTempCity <- aggregate (tempData$avgTempDay, by=list(city=tempData$city), FUN=mean)

write(avgTempCity, file = “output.csv", sep = “, “)

Page 12: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation12

User Experience for Big R

Connect to BI cluster

Data frame proxy to large data file

Data transformation step

Run scalable linear regression on cluster

Page 13: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation13

3 Key Capabilities in Big R

� Use of R as a language on Big Data

− Scalable data processing

� Running native R functions in Hadoop

− Can leverage existing R assets (code and CRAN packages)

� Running scalable algorithms beyond R in Hadoop

− Wide class of algorithms and growing

− R-like syntax to develop new algorithms and customize existing algorithms

1

2

3

End-to-end integration of R into BigInsights Hadoop

Page 14: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation14

Big R Data Structures: Proxy to entire dataset

data <- bigr.frame(…)

Appears and acts like all of the data is on your laptop

You

1

Page 15: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation15

Out-of-box Big R Functions: Seamlessly compile into MapReduce

dataCorrelation <- cor(data)

MapReduce job runs over the entire dataset

You

1

Page 16: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation16

Big R Partitioned Execution: Run R functions on partitions of data

16

R

R

R

R

R

R

R

R

R

R

R

R

Each map stands up an instance of R

models <- rowApply(... some R function…)

You

2

Page 17: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation17

Big R Partitioned Execution: How rowApply works (on 4 nodes)

17

R

R

R

R

Each partition of rows stands up an instance of R

Logical representation of dataset

2

Page 18: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation18

SystemML from Big R: Statistics and machine learning at scale

model <- bigr.lm(…)

Optimized MapReduce jobs run over the entire dataset

You

3

Page 19: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation19

Rich Functionality in Big R

Big R Function

Connection connect, disconnect, …

HDFS listfs, rmfs

Types & Functions

Types bigr.frame, bigr.vector

Functionsdim, nrow, colnames, coltypes, head, tail, na.string, na.omit, sort, summary

Coercion and Casting

as.bigr.frame, as.data.frame, ….vectoras.integer, as.logical, as.numeric

Built-in Functions

Arithmetic +, -, *, /, ^

Mathematical abs, acos, asin, atan, ceiling, floor, exp, …

String grepl, substr

Statistical cor, cov, mean, sd

Miscellaneous attach, pull, random, sample, ifelse

Visualization histogram

Apply R functions groupApply, tableApply, rowApply

Run scalable algorithms bigr.lm, bigr.svm, bigr. … (see subsequent slide)3

2

1

Page 20: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation20

Scalable Machine Learning Algorithms in Big R

Category Description Big R Function

Descriptive Statistics

Univariate bigr.univariateStats()

Bivariate bigr.bivariateStats()

Stratified Bivariate bigr.bivariateStats()

Classification

Logistic Regression (multinomial) bigr.logistic.regression()

Multi-Class SVM bigr.svm()

Naïve Bayes (multinomial) bigr.naive.bayes()

Clustering k-Means bigr.kmeans()

Regression

Linear Regression

system of equations bigr.lm()

CG (conjugate gradient descent) bigr.lm()

Generalized Linear Models (GLM)

Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial and Bernoulli

bigr.glm()

Links for all distributions: identity, log, sq. root, inverse, 1/µ2

bigr.glm()

Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit

bigr.glm()

Predict Scoring bigr.predict()

Transformationdummy coding, binning, scaling, missing value imputation

bigr.transform()

3

Page 21: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation21

What’s behind running Big R’s Scalable Algorithms?

� High-level declarative language with R-like syntax shields your algorithm investment from platform progression

� Cost-based compilation of algorithms to generate execution plans

− Compilation and parallelization

• Based on data characteristics

• Based on cluster and machine characteristics

− In-Memory single node and MR execution

� Enable algorithm developer productivity to build additional algorithms (scalability, numeric stability and optimizations)

Hadoop Cluster

In-Memory

Single Node

Declarative analytics:1) Future-proof algorithm investment2) Automatic performance tuning

Page 22: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation22

K-Means Input DataX = 10m x 500 (135 GB Text file)K = 10

Compute distance matrix D

Minimum distance for each record, minD

Find all closest centroids for each record, P

Compute new centers, C

Compute normalized Pthat accounts for records w/ multiple closest centroids

Input Data (X)

1 MR Job

In Memory

In Memory

In Memory

1 MR Job

10m x 10(dense: ~800 MB)

10m x 1(dense: ~80 MB)

10 x 500(dense: ~40MB)

Page 23: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation23

K-Means Input DataX = 300m x 500 (4 TB Text file)K = 10

Compute distance matrix D

Minimum distance for each record, minD

Find all closest centroids for each record, P

Compute new centers, C

Compute normalized Pthat accounts for records w/ multiple closest centroids

Input Data (X)

1 MR Job

1 MR Job

1 MR Job

4 MR Jobs

2 MR Jobs

300m x 10(dense: ~24 GB)

300m x 1(dense: ~2.4 GB)

10 x 500(dense: ~40MB

Page 24: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation24

Physical Cluster• 5 machines, each 2x4 (16 HWT),

64GB RAM • 1.5 TB Storage, 1 GbE

Hadoop Cluster• Map Capacity: 80• Reduce Capacity: 10• -Xmx1024m• SystemML

All operations execute on

Single machine0 MR Jobs

Hybrid Execution

(majority of operations execute on single machine)

4 MR Jobs

Hybrid Execution

(majority of operations execute in map-reduce)

6 MR Jobs

Matrix Factorization Sample – Scalability and Performance

Page 25: Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

© 2015 IBM Corporation25