introduction to r for data mining (feb 2013)

24
Revolution Confidential Introduction to R for Data Mining 2013 Webinar Series Joseph B. Rickert February 14, 2013 1

Upload: revolution-analytics

Post on 26-Jan-2015

146 views

Category:

Technology


0 download

DESCRIPTION

Presented: Thursday, February 14, 2013 Presenter: Joseph Rickert, Technical Marketing Manager, Revolution Analytics We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start: Find a way of orienting yourself in the open source R world Have a definite application area in mind Set an initial goal of doing something useful and then build on it In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will: Provide an orientation to R’s data mining resources Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data. Show the simple R commands to accomplish these same tasks without the GUI Demonstrate how to build on these fundamental skills to gain further competence in R Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.

TRANSCRIPT

Revolution Confidential

Introduc tion to R for

Data Mining

2013 Webinar S eries

J os eph B . R ic kert

F ebruary 14, 2013

1

Revolution Confidential F irs t P olling Ques tion

What is your favorite data mining software tool? 1. R 2. SAS 3. MapReduce 4. Weka 5. Other

2

Revolution Confidential

My goal for today’s webinar is to convince you that:

3

R is a serious

platform for

data mining

Revolution R Enterprise

is the platform for

serious data mining

Seriously, it is not difficult to learn enough R to do some serious data

mining

Revolution Confidential

A word about Data Mining

We assume that you know a little bit about data mining and this is

your context for learning R

4

Revolution Confidential Data Mining

5

Applications

Credit Scoring

Fraud Detection

Ad Optimization

Targeted Marketing

Gene Detection

Recommendation systems

Social Networks

Actions

Acquire Data

Prepare

Classify

Predict

Visualize

Optimize

Interpret

Algorithms

CART

Random Forests

SVM

KMeans

Hierarchical clustering

Ensemble Techniques

Revolution Confidential

WHAT IS R ? Getting Orientated

6

Revolution Confidential Is :

The way to do statistical computing A full blown programming language The home of nearly every data mining

algorithm known to data science. A vibrant world-wide community

7

R was written in early 1990’s by

Robert Gentleman Ross Ihaka

Since 1997 a core group of ~ 20

developers guides the evolution of the

language

Revolution Confidential

is organized into libraries of functions c alled pac kages

CRAN R download Base Recommended packages

User contributed packages

8

R Package Growth 4,332 packages as of 2/13/13

Revolution Confidential

T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING

Learning R

10

Revolution Confidential L earning R ?

11

Levels of R Skill Write production grade code Write an R package Write code and algorithms Use R functions Use a GUI

R developer

R contributor

R programmer

R user

R aware

Hours of use

10 10,000

The Malcolm Gladwell “Outlier” Scale

Revolution Confidential B as ic Mac hine L earning F unc tions

12

Function Library Description Cluster hclust stats Hierarchical cluster analysis

kmeans stats Kmeans clustering Classifiers glm stats Logistic Regression

rpart rpart Recursive partitioning and regression trees

ksvm kernlab Support Vector Machine apriori arules Rule based classification

Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and

regression

Revolution Confidential Noteworthy Data Mining P ac kages

13

Package Comment caret Well organized and remarkably complete

collection of functions to facilitate model building for regression and classification problems

rattle A very intuitive GUI for data mining that produces useful R code

Revolution Confidential

T IME TO R UN S OME C ODE Doing a lot with a little R

14

Script 1 GETTING STARTED .R 2 ROLL with RATTLE .R 3 IN THE TREES . R 4 INTRO to CARET .R 5 BIG DATA with RevoScaleR .R 6 WORDCLOUD .R

The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529

Revolution Confidential S ec ond P olling Ques tion

What are your favorite data mining techniques? 1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees,

or SVMs 3. Ensemble classifiers such as Random Forests

or boosting models 4. Text mining techniques 5. Other

15

Revolution Confidential

T hird P olling Ques tion (ins ert after running s cript IN T HE T R E E S

What kind of data do you analyze? 1. Financial data 2. Customer data (e.g. for recommendations) 3. Website data (e.g. for ads) 4. Health Care data 5. Other

16

Revolution Confidential

Working with B ig Data

RevoScaleR and Revolution R Enterprise

17

Revolution Confidential Too B ig for Open S ourc e R

18

mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial")

Revolution Confidential

R evoS caleR brings the power of B ig Data to R

19

Distributed Statistical Algorithms

Communications Framework

Data Source API

R Language Interface

Parallel External Memory Algorithms that are distributed among available compute resources (cores & computers) independent of platform

Abstracted layer for providing

communication between compute nodes in a cluster

(MPI, MapReduce, In-Database)

API for integrating external data sources (files, databases, HDFS) that provides optimized reading of rows and columns in blocks

Familiar, high-prodictivity

programming paradigm for R users

Revolution Confidential

R evoS caleR P E MA s P arallel E xternal Memory A lgorithms

20

Block 1

Block 2

Block i

Block i + 1

Block i + 2

XDF File

Block i Block i + 1

Block i + 2

Read blocks and compute intermediate results in parallel, iterating as necessary Block 1

results

Block i results

Block i+1 results

Block i+2 results

Results from last block

2nd pass

3rd pass

1st pass

R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data

Revolution Confidential

WHE R E TO G O F R OM HE R E ? More than code, R is a community

21

Revolution Confidential C ontinuing to L earn R

Resources RevoJoe: How to Learn R More R Documentation

The R Journal Books Reference Card and more

Classes Coursera Revolution Analytics

Examples Thomson Nguyen on the Heritage

Health Prize Shannon Terry & Ben Ogorek

(Nationwide Insurance): A Direct Marketing In-Flight Forecasting System

Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment

Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R)

22

Revolution Confidential S ome B ooks

23

Revolution Confidential

24

The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529