using r to win kaggle data mining competitions

87
Using R to win Kaggle Data Mining Competitions Chris Raimondi November 1, 2012

Upload: maina

Post on 25-Feb-2016

148 views

Category:

Documents


0 download

DESCRIPTION

Using R to win Kaggle Data Mining Competitions. Chris Raimondi November 1, 2012. Overview of talk. What I hope you get out of this talk Life before R Simple model example R programming language Background/Stats/Info How to get started Kaggle. Overview of talk. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using R to win Kaggle  Data Mining Competitions

Using Rto win

Kaggle Data Mining Competitions

Chris RaimondiNovember 1, 2012

Page 2: Using R to win Kaggle  Data Mining Competitions

Overview of talk• What I hope you get out of this talk• Life before R• Simple model example• R programming language• Background/Stats/Info• How to get started

• Kaggle

Page 3: Using R to win Kaggle  Data Mining Competitions

Overview of talk• Individual Kaggle competitions• HIV Progression• Chess• Mapping Dark Matter• Dunnhumby’s Shoppers Challenge• Online Product Sales

Page 4: Using R to win Kaggle  Data Mining Competitions

What I want you to leave with• Belief that you don’t need to be a

statistician to use R - NOR do you need to fully understand Machine Learning in order to use it• Motivation to use Kaggle

competitions to learn R• Knowledge on how to start

Page 5: Using R to win Kaggle  Data Mining Competitions

My life before R• Lots of Excel• Had tried programming in the past –

got frustrated• Read NY Times article in January

2009 about R & Google• Installed R, but gave up after a

couple minutes• Months later…

Page 6: Using R to win Kaggle  Data Mining Competitions

My life before R• Using Excel to run PageRank

calculations that took hours and was very messy

• Was experimenting with Pajek – a windows based Network/Link analysis program

• Was looking for a similar program that did PageRank calculations

• Revisited R as a possibility

Page 7: Using R to win Kaggle  Data Mining Competitions

My life before R• Came across “R Graph Gallery”• Saw this graph…

Page 8: Using R to win Kaggle  Data Mining Competitions
Page 9: Using R to win Kaggle  Data Mining Competitions

Addicted to R in one line of code

pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21, bg=c("red", "green3", "blue")[unclass(iris$Species)])

“pairs” = function“iris” = dataframe

Page 10: Using R to win Kaggle  Data Mining Competitions

What do we want to do with R?

• Machine learninga.k.a. – or more specifically

• Making models

We want to TRAIN a set of data with KNOWN answers/outcomes

In order to PREDICT the answer/outcome to similar data where the answer is not known

Page 11: Using R to win Kaggle  Data Mining Competitions
Page 12: Using R to win Kaggle  Data Mining Competitions

How to train a model

R allows for the training of models using probably over 100 different machine learning methods

To train a model you need to provide1. Name of the function – which machine learning

method2. Name of Dataset3. What is your response variable and what features

are you going to use

Page 13: Using R to win Kaggle  Data Mining Competitions

Example machine learning methods available in R

Bagging Partial Least SquaresBoosted Trees Principal Component RegressionElastic Net Projection Pursuit RegressionGaussian Processes Quadratic Discriminant AnalysisGeneralized additive model Random ForestsGeneralized linear model Recursive PartitioningK Nearest Neighbor Rule-Based ModelsLinear Regression Self-Organizing MapsNearest Shrunken Centroids Sparse Linear Discriminant AnalysisNeural Networks Support Vector Machines

Page 14: Using R to win Kaggle  Data Mining Competitions

Code used to train decision tree

library(party)irisct <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

Or use “.” to mean everything else - as in…

irisct <- ctree(Species ~ ., data = iris)

Page 15: Using R to win Kaggle  Data Mining Competitions

That’s itYou’ve trained your model – to make predictions with it – use the “predict” function – like so:

my.prediction <- predict(irisct, iris2)

To see a graphic representation of it – use “plot”.

plot(irisct)

plot(irisct, tp_args = list(fill = c("red", "green3", "blue")))

Page 16: Using R to win Kaggle  Data Mining Competitions
Page 17: Using R to win Kaggle  Data Mining Competitions
Page 18: Using R to win Kaggle  Data Mining Competitions

R background• Statistical Programming Language• Since 1996• Powerful – used by companies like

Google, Allstate, and Pfizer.• Over 4,000 packages available on

CRAN• Free• Available for Linux, Mac, and

Windows

Page 19: Using R to win Kaggle  Data Mining Competitions

Learn R – Starting Tonight

• Buy “R in a Nutshell”• Download and Install R• Download and Install Rstudio• Watch 2.5 minute video on

front page of rstudio.com• Use read.csv to read a Kaggle

data set into R

Page 20: Using R to win Kaggle  Data Mining Competitions

Learn R – Continue Tomorrow

• Train a model using Kaggle data• Make a prediction using that

model• Submit the prediction to Kaggle

Page 21: Using R to win Kaggle  Data Mining Competitions

Learn R – This Weekend• Install the Caret package• Start reading the four Caret

vignettes• Use the “train” function in Caret

to train a model, select a parameter, and make a prediction with this model

Page 22: Using R to win Kaggle  Data Mining Competitions

Buy This Book: R in a Nutshell

• Excellent Reference• 2nd Edition released

just two weeks ago• In stock at Amazon

for $37.05• Extensive chapter on

machine learning

Page 23: Using R to win Kaggle  Data Mining Competitions

R Studio

Page 24: Using R to win Kaggle  Data Mining Competitions
Page 25: Using R to win Kaggle  Data Mining Competitions

R Tip

Read the vignettes – some of them are golden.

There is a correlation between the quality of an R package and its associated vignette.

Page 26: Using R to win Kaggle  Data Mining Competitions
Page 27: Using R to win Kaggle  Data Mining Competitions

What is kaggle?• Platform/website for predictive

modeling competitions• Think middleman – they provide

the tools for anyone to host a data mining competition

• Makes it easy for competitors as well – they know where to go to find the data/competitions

• Community/forum to find teammates

Page 28: Using R to win Kaggle  Data Mining Competitions

Kaggle Stats• Competitions started over 2 years

ago• 55+ different competitions• Over 60,000 Competitors• 165,000+ Entries• Over $500,000 in prizes awarded

Page 29: Using R to win Kaggle  Data Mining Competitions

Why Use Kaggle?• Rich Diverse Set of Competitions• Real World Data• Competition = Motivation• Fame• Fortune

Page 30: Using R to win Kaggle  Data Mining Competitions

Who has Hosted on Kaggle?

Page 31: Using R to win Kaggle  Data Mining Competitions

Methods used by competitors

source:kaggle.com

Page 32: Using R to win Kaggle  Data Mining Competitions

Predict HIV Progression

1st $500.00Prizes:

Objective:Predict (yes/no) if there will be an

improvement in a patient's HIV viral load.

Training Data:1,000 Patients

Testing Data:692 Patients

Page 33: Using R to win Kaggle  Data Mining Competitions

Answer Various Features

Response PR Seq RT Seq VL-t0 CD4-t01 CCTCAGATCA TACCTTAAAT 4.7 4731 CACTCTAAAT CTTAAATTTY 5.0 70 AAGAAATCTG CCTCAGATCA 3.2 3490 AAGAAATCTG CTCTTTGGCA 5.1 510 AAGAAATCTG GAGAGATCTG 3.7 770 CACTCTAAAT CTTAAATTTY 5.7 2060 AAGAAATCTG TCTAAATTTC 3.9 1440 CACTTTAAAT TCTAAACTTT 4.4 4960 AAGAAATCTG CTCTTTGGCA 3.4 2521 TGGAAGAAAT CTCTTTGGCA 5.5 71 TTCGTCACAA CTCTTTGGCA 4.3 1090 AAGAGATCTG CTCTTTGGCA 5.0 700 ACTAAATTTT CTCTTTGGCA 5.0 5700 CCTCAAATCA CTCTTTGGCA 4.0 2171 CCTCAGATCA TCTAAATTTC 2.8 7300 ATTAAATTTT CTCTTTGGCA 4.5 560 ATTAAATTTT TACTTTAAAT 5.1 211 CCTCAGATCA CTCTTTGGCA 5.5 2490 CCTCAAATCA CTTAAATTTT 4.0 2691 AAGGAATCTG CCTCAGATCA 4.6 1650 AAGAAATCTG TCTAAATTTC 3.9 1440 CACTTTAAAT TCTAAACTTT 4.4 4960 AAGAAATCTG CTCTTTGGCA 3.4 2521 TGGAAGAAAT CTCTTTGGCA 5.5 91

Training Set

N/AN/AN/AN/AN/AN/AN/AN/AN/AN/A

Trai

ning

Test

Public Leaderboard

Private Leaderboard

Page 34: Using R to win Kaggle  Data Mining Competitions

Predict HIV Progression

Page 35: Using R to win Kaggle  Data Mining Competitions

Predict HIV ProgressionFeatures Provided:

1.PR: 297 letters long – or N/A2.RT: 193 – 494 letters long3.CD4: Numeric4.VLt0: Numeric

Features Used:1.PR1-PR97: Factor2.RT1-RT435: Factor3.CD4: Numeric4.VLt0: Numeric

Page 36: Using R to win Kaggle  Data Mining Competitions

Predict HIV ProgressionConcepts / Packages:

• Caret• train• rfe

• randomForest

Page 37: Using R to win Kaggle  Data Mining Competitions

Random ForestSepal.Length Sepal.Width Petal.Length Petal.Width

5.1 3.5 1.4 0.24.9 3 1.4 0.24.7 3.2 1.3 0.24.6 3.1 1.5 0.2

5 3.6 1.4 0.25.4 3.9 1.7 0.44.6 3.4 1.4 0.3

5 3.4 1.5 0.24.4 2.9 1.4 0.24.9 3.1 1.5 0.15.4 3.7 1.5 0.24.8 3.4 1.6 0.24.8 3 1.4 0.14.3 3 1.1 0.15.8 4 1.2 0.25.7 4.4 1.5 0.45.4 3.9 1.3 0.45.1 3.5 1.4 0.35.7 3.8 1.7 0.35.1 3.8 1.5 0.3

Tree 1:

Take a random ~ 63.2% sample of rows from the data set

For each node – take mtry random features – in this case 2 would be the default

Tree 2:

Take a different random ~ 63.2% sample of rows from the data set

And so on…..

Page 38: Using R to win Kaggle  Data Mining Competitions

Caret – trainTrainData <- iris[,1:4] TrainClasses <- iris[,5]

knnFit1 <- train(TrainData, TrainClasses, method = "knn",

preProcess = c("center", "scale"), tuneLength = 3, trControl = trainControl(method = "cv", number=10))

Page 39: Using R to win Kaggle  Data Mining Competitions

Caret – train> knnFit1150 samples 4 predictors 3 classes: 'setosa', 'versicolor', 'virginica'

Pre-processing: centered, scaled Resampling: Cross-Validation (10 fold)

Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...

Resampling results across tuning parameters:

Page 40: Using R to win Kaggle  Data Mining Competitions

Caret – traink Accuracy Kappa Accuracy SD Kappa SD 5 0.94 0.91 0.0663 0.0994 7 0.967 0.95 0.0648 0.0972 9 0.953 0.93 0.0632 0.0949 11 0.953 0.93 0.0632 0.0949 13 0.967 0.95 0.0648 0.0972 15 0.967 0.95 0.0648 0.0972 17 0.973 0.96 0.0644 0.0966 19 0.96 0.94 0.0644 0.0966 21 0.96 0.94 0.0644 0.0966 23 0.947 0.92 0.0613 0.0919

Accuracy was used to select the optimal model using the largest value.The final value used for the model was k = 17.

Page 41: Using R to win Kaggle  Data Mining Competitions
Page 42: Using R to win Kaggle  Data Mining Competitions

Benefits of winning• Cold hard cash• Several newspaper articles• Quoted in Science magazine• Prestige• Easier to find people willing to

team up• Asked to speak at STScI• Perverse pleasure in telling

people the team that came in second worked at….

Page 43: Using R to win Kaggle  Data Mining Competitions

IBM Thomas J. Watson Research Center

Page 44: Using R to win Kaggle  Data Mining Competitions

Chess Ratings Comp1st $10,000.00

Prizes:

Objective:Given 100 months of data predict game

outcomes for months 101 – 105.

Training Data Provided:1. Month2. White Player #3. Black Player #4. White Outcome – Win/Draw/Lose

(1/0.5/0)

Page 45: Using R to win Kaggle  Data Mining Competitions

How do I convert the data into a flat 2D

representation?

Think:1. What are you trying to

predict?2. What Features will you

use?

Page 46: Using R to win Kaggle  Data Mining Competitions

Outcome

White Feature 1

White Feature 2

White Feature 3

White Feature 4

Black Feature 1

Black Feature 2

Black Feature 3

Black Feature 4

White/Black 1

White/Black 2

White/Black 3

White/Black 4

Game Feature 1

Game Feature 2

1

0.5

1

1

0

1

0.5

1

0

0

Percentage of Games W

on

Percentage of Games W

on

Num

ber of Games w

on as White

Num

ber of Games w

on as White

Num

ber of Games Played

Num

ber of Games Played

White Gam

es Played/Black Games Played

Type of Game Played

Page 47: Using R to win Kaggle  Data Mining Competitions

Packages/Concepts Used:

1. igraph2. 1st real function

Page 48: Using R to win Kaggle  Data Mining Competitions
Page 49: Using R to win Kaggle  Data Mining Competitions

Mapping Dark Matter

Mapping Dark Matter

1st ~$3,000.00Prizes:

Objective:“Participants are provided with 100,000

galaxy and star pairs. A participant should provide an estimate for the ellipticity for

each galaxy.”

The prize will be an expenses paid trip to the Jet Propulsion Laboratory (JPL) in Pasadena, California to attend the GREAT10 challenge workshop "Image Analysis for Cosmology".

Page 50: Using R to win Kaggle  Data Mining Competitions

dunnhumby's Shopper Challenge

1st $6,000.002nd $3,000.003rd $1,000.00

Prizes:

Objective:• Predict the next date that the

customer will make a purchaseAND

• Predict the amount of the purchase to within £10.00

Page 51: Using R to win Kaggle  Data Mining Competitions

Data ProvidedFor 100,000 customers:

April 1, 2010 – June 19, 20111. customer_id 2. visit_date 3. visit_spend

For 10,000 customers:April 1, 2010 – March 31, 2011

4. customer_id 5. visit_date 6. visit_spend

Page 52: Using R to win Kaggle  Data Mining Competitions

Really two different challenges:

1) Predict next purchase dateMax of ~42.73% obtained

2) Predict purchase amount to within £10.00

Max of ~38.99% obtained

If independent 42.73% * 38.99% = 16.66%

In reality – max obtained was 18.83%

Page 53: Using R to win Kaggle  Data Mining Competitions

dunnhumby's Shopper Challenge

Packages Used & Concepts Explored:

1st competition with real dates• zoo• arima• forecast

SVD• svd• irlba

Page 54: Using R to win Kaggle  Data Mining Competitions

SVDSingular value decomposition

Page 55: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

=

X X

D V T

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

3rd M

ost I

mpo

rtan

t4th

Mos

t Im

port

ant

. . .

Nth

Mos

t Im

port

ant

12

34

…N

N x N 1st

2nd

3rd

4th

. . .Nth

N x N

Row Features

Colu

mn

Feat

ures

Col 1

Col 2

Col 3

Col 4

… Col N

Row 1

Row 2

Row 3

Row 4

Row N

Page 56: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D V T

1st M

ost I

mpo

rtan

t

1 1st

x <- read.jpeg("test.image.2.jpg")im <- imagematrix(x, type = "grey")

im.svd <- svd(im)

u <- im.svd$ud <- diag(im.svd$d)v <- im.svd$v

Page 57: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D V T

1st M

ost I

mpo

rtan

t

1 1st

new.u <- as.matrix(u[, 1:1])new.d <- as.matrix(d[1:1, 1:1])new.v <- as.matrix(v[, 1:1])

new.mat <- new.u %*% new.d %*% t(new.v)

new.im <- imagematrix(new.mat, type = "grey")plot(new.im, useRaster = TRUE)

Page 58: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D V T

1st M

ost I

mpo

rtan

t

1 1st

Page 59: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D V T

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

12

1st

2nd

Page 60: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D V T

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

3rd M

ost I

mpo

rtan

t

12

3

1st

2nd

3rd

Page 61: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D V T

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

3rd M

ost I

mpo

rtan

t4th

Mos

t Im

port

ant

12

34

1st

2nd

3rd

4th

Page 62: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D V T

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

3rd M

ost I

mpo

rtan

t4th

Mos

t Im

port

ant

5th M

ost I

mpo

rtan

t

12

34

5

1st

2nd

3rd

4th

5th

Page 63: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D VT

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

3rd M

ost I

mpo

rtan

t4th

Mos

t Im

port

ant

5th M

ost I

mpo

rtan

t6th

Mos

t Im

port

ant

12

34

56

1st

2nd

3rd

4th

5th6th

Page 64: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

807 x

1209

U

~

X X

D VT

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

3rd M

ost I

mpo

rtan

t4th

Mos

t Im

port

ant

…80

7th M

ost I

mpo

rtan

t

12

34

.807

1st

2nd

3rd

4th

…807th

Page 65: Using R to win Kaggle  Data Mining Competitions

U

=

X X

D V T

1st M

ost I

mpo

rtan

t2nd

Mos

t Im

port

ant

3rd M

ost I

mpo

rtan

t4th

Mos

t Im

port

ant

. . .

Nth

Mos

t Im

port

ant

12

34

…N

365x365 1st

2nd

3rd

4th

. . .Nth

365x

365

Customer Features

Day

Fea

ture

s

Day 1

Day 2

Day 3

Day 4

… Day N

Cust 1

Cust 2

Cust 3

Cust 4

Cust 5

OriginalMatrix

100,000 x

365

100,000x

365

Page 66: Using R to win Kaggle  Data Mining Competitions

D1

23

4…

N

365x365

OriginalMatrix

100,000 x

365

Page 67: Using R to win Kaggle  Data Mining Competitions

OriginalMatrix

100,000 x

365

U[,1] = 100,000 x 1

=1st

Mos

t Im

port

ant

1st = 365 x 1 =V T

Page 68: Using R to win Kaggle  Data Mining Competitions

1st = 365 x 1 [first 28 shown]=V T

Page 69: Using R to win Kaggle  Data Mining Competitions

2nd = 365 x 1 [first 28 shown]=V T

Page 70: Using R to win Kaggle  Data Mining Competitions

3rd = 365 x 1 [first 28 shown]=V T

Page 71: Using R to win Kaggle  Data Mining Competitions

4th = 365 x 1 [first 28 shown]=V T

Page 72: Using R to win Kaggle  Data Mining Competitions

5th = 365 x 1 [first 28 shown]=V T

Page 73: Using R to win Kaggle  Data Mining Competitions

6th = 365 x 1 [first 28 shown]=V T

Page 74: Using R to win Kaggle  Data Mining Competitions

7th = 365 x 1 [first 28 shown]=V T

Page 75: Using R to win Kaggle  Data Mining Competitions

8th = 365 x 1 [all 365 shown]=V T

Page 76: Using R to win Kaggle  Data Mining Competitions

Online Product Sales

1st $15,000.002nd $ 5,000.003rd $ 2,500.00

Prizes:

Objective:“[P]redict monthly online sales of a

product. Imagine the products are online self-help programs following an initial

advertising campaign.”

Page 77: Using R to win Kaggle  Data Mining Competitions

Online Product Sales

Packages/Concepts Explored:

1. Data analysis – looking at data closely

2. gbm3. Teams

Page 78: Using R to win Kaggle  Data Mining Competitions

Online Product Sales

Looking at data closely

... 6532 6532 6661 6661 7696 7701 7701 8229 8412 8895 9596 9596 9772 9772 ...

Cat_1=0 Cat_1=16274 1 16532 1 16661 1 17696 0 17701 1 18229 1 08412 1 08895 1 09596 1 19772 1 1

Page 79: Using R to win Kaggle  Data Mining Competitions

Online Product Sales

On the public leaderboard:

Page 80: Using R to win Kaggle  Data Mining Competitions

Online Product Sales

On the private leaderboard:

Page 81: Using R to win Kaggle  Data Mining Competitions

Thank You!

Questions?

Page 82: Using R to win Kaggle  Data Mining Competitions

Extra Slides

Page 83: Using R to win Kaggle  Data Mining Competitions

R Code for Dunnhumby Time Series

Page 84: Using R to win Kaggle  Data Mining Competitions
Page 85: Using R to win Kaggle  Data Mining Competitions
Page 86: Using R to win Kaggle  Data Mining Competitions
Page 87: Using R to win Kaggle  Data Mining Competitions

4 X 150 4 X 150

U

=

4 X 4X X

D V T

> my.svd <- svd(iris[,1:4])> objects(my.svd)[1] "d" "u" "v"> my.svd$d [1] 95.959914 17.761034 3.460931 1.884826 > dim(my.svd$u) [1] 150 4 > dim(my.svd$v) [1] 4 4

4 X 4