random forests r vs python by linda uruchurtu

Post on 26-Jan-2015

115 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Random Forests R vs Python by Linda Uruchurtu

TRANSCRIPT

RANDOM FORESTSR vs PYTHONR & PYTHON

H!vin" fun when st!rtin" out in d!t! !n!l#sis

WHOLINDA URUCHURTU@lind!uruchurtu

Consult!nt !t DBi Web An!l"tics & D!t! Consult!nc"

Ph"sicist b" tr!inin#

OUTLINE OF THIS TALK• Motiv!tion• R!ndom Forests: R & Python

• Ex!mple: EMI music set

• Concludin" rem!rks

MOTIVATION

STARTING OUT IN DATA ANALYSIS

• Online: blo"s, GitHub, MOOCs, K!""le, D!t! T!u, Cross V!lid!ted, St!ckoverflow...

• Books• School work

TOO MANY RESOURCES

WHICH LANGUAGE SHOULD I USE?POPULAR QUESTION

LET’S ASK GOOGLE

• Pro"r!mmed in C• Used MATLAB !t Uni• Spent ! lon" time pl!#in" with s#mbolic

l!n"s M!them!tic! & M!ple

START BY WHAT YOU KNOW & ASK YOUR FRIENDS

MY EXPERIENCE

P.S. I h!d not met the iP"thon notebook.

BIG REVEAL: I AM AN AVID R USER

MY EXPERIENCE (cont)

P.S. I h!d not met the iP"thon notebook.

• Don’t h!ve ! web dev b!ck"round• Surrounded b# people doin" St!ts• Pick the ri"ht tool for the t!sk !t h!nd

TL;DR - CAN BE CONFUSING FOR A NEWBIE

LANGUAGE WARSToo m!n" !rticles !bout:

• “P!thon Displ"cin# R As The Pro#r"mmin# L"n#u"#e For D"t" An"l!sis”

• “Is P!thon re"ll! suppl"ntin# R for d"t" work?”• “10 Re"sons P!thon Rocks for Rese"rch”• “Wh! P!thon is ste"dil! e"tin# other l"n#u"#es' lunch”• “Wh! I’m bettin# on Juli"”• “Wh"t "re the "dv"nt"#es of usin# P!thon over R?”• “Wh! P!thon with Coffee is better th"n R with Ice

Cre"m”

[FAVE LANG] is BETTERBECAUSE I SAY SO

LANGUAGE WARSHowever, it is "ood to h!ve ! "ener!l underst!ndin" of the + !nd - of the v!rious d!t! !n!l#sis tools, in order to pick the ri"ht tool for the job.

• R h!s EVERYTHING "ou need for performin# st!tistic!l !n!l"sis.

• R / MATLAB / Python !re #re!t for protot"pin#• Python is ! full fe!tured pro#r!mmin# l!n#u!#e• E!sier to incorport!te Python outcomes into ! full

d!t! product workflow

DEFINE THE PROBLEMTime better spent definin# the problem !nd determinin# wh!t is the best w!" to solve it

GOOD TO HAVE A BIG BAG OF TRICKS

Re-do R !n!l"sis usin# Python d!t! !n!l"sis st!ck

WILL IT PYTHON? CREDIT: SLENDER MEANS

PYTHON SCIKIT LEARN

IT IS PRETTY AWESOME

• Libr!r" of M!chine Le!rnin# Al#orithms• Open source• API• P"thon, Nump" & Co• Accessible, m!n" models, document!tion &

ex!mples

EXAMPLE

CHOOSING A PROBLEMAlw!"s ! #ood ide! to look for ! d!t! set th!t is interestin# to "ou.

12 Formul!te ! question

3 Formul!te !n h"pothesis

4 Build Model to !nswer question !nd Test

SCIENTIFIC METHOD FTW

CHOOSING A DATA SETSTEP 1

EMI MUSIC “ONE MILLION INTERVIEW SET”

• One of the l!r#est preference d!t! sets in the world.

• Extr!ct used in Data Science London h!ck!ton !nd !v!il!ble in KAGGLE !s four sep!r!te d!t! sets.

FOUR DATA SETS• TRAIN / TEST - !rtist, tr!ck, userID, time & r!tin"s

• WORDS - userID, he!rd_of, own_!rtist_music , like_!rtist, 82 !djectives

• USERS - userID, "ender, !"e, workin" st!tus, re"ion, music, list_own (hours per d!#), list_b!ck (hours per d!#), 19 user h!bits questions (0-100)

USERSKEY STRING

1 “Music is import!nt to me but not necess!ril" most import!nt”

2 “I like music but it does not fe!ture he!vil" in m" life”

3 “Music me!ns ! lot to me !nd it is ! p!ssion of mine”

4 “Music h!s no p!rticul!r interest to me”

5 “Music is import!nt to me but not necess!ril" more import!nt th!n other hobbies”

6 “Music is no lon#er !s import!nt !s it used to be”

WORDS DATASET

UNINSPIRED, AGGRESSIVE, UNATTRACTIVE, BORING, CHEAP, IRRELEVANT, WAY OUT, ANNOYING, CHEESY, UNORIGINAL, OUTDATED, UNAPPROACHABLE...

82 ADJECTIVES

WHOLESOME

LEGENDARY

OLD

PIONEER DARK

WORDLY

NOSTALGIC

PROGRESSIVE

ICONIC

USERS19 MUSIC HABIT QUESTIONS: R!te (0-100) whether user !#rees with the st!tements:

“I enjo" !ctivel" se!rchin# for !nd discoverin# music th!t I h!ve never he!rd before”

“I !m not willin# to p!" for music”

“I like to be !t the cuttin# ed#e of new music”

“I love tech”

WHOLESOME

LEGENDARY

OLD

PIONEER DARK

WORDLY

NOSTALGIC

PROGRESSIVE

ICONIC

FORMULATE A QUESTIONSTEP 2

MOTIVATION

MOTIVATION• PRODUCTION - Che!per to produce (lower b!rriers to

entr# for buddin" !rtists).

• DISTRIBUTION - Internet h!s m!de music more !ccessible. Artists c!n decide where !nd how to sell.

• CONSUMPTION - People’s listenin" h!bits h!ve ch!n"ed due to the internet !nd to the ch!n"e in devices.

TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.

PROBLEMS• ARTISTS - E!sier to produce music, h!rder to m!ke

themselves known or e!rn ! livin".

• RECORD COMPANIES - People bu# per son", e!s# for listener to consume without p!#in". Wider competition field.

• LISTENERS - Too m!n# choices. Discover# is difficult.

QUESTIONS• C!n one predict the r!tin" of ! son"?

• Wh!t f!ctors !re import!nt to determine how much ! person likes ! son"?

• Wh!t is the minim!l set of f!ctors th!t !re needed to determine how much ! person likes ! son"?

FORMULATE AN HYPOTHESISSTEP 3

FIRST ATTEMPT• Re"ression problem

• Turn c!te"oric!l v!ri!bles into numeric v!ri!bles

• Consider ALL fe!tures !nd pick m!chine le!rnin" !l"orithm to do the job.

CAN ONE PREDICT THE RATING OF A SONG?

FIRST ATTEMPT

• Bec!use explor!tor# !n!l#sis reve!led r!tin"s !re hi"hl# clustered, we c!n look !t five different scores !nd formul!te problem !s ! cl!ssific!tion one.

CAN ONE PREDICT THE RATING OF A SONG?

We split r!tin"s 0-100 in 5 interv!ls,so e!ch becomes ! cl!ss !nd we l!bel these.

BUILD A MODELSTEP 4

RANDOM FORESTS

RANDOM FORESTS

• R"ndom Forests "re built from "##re#"tin# trees.

• C"n be used for re#ression & cl"ssific"tion problems.

• The! do not overfit "nd c"n h"ndle l"r#e "mount of fe"tures

• The! "lso output " list of fe"tures th"t "re believed to be import"nt in predictin# the v"ri"ble

Hi"hl# vers!tile ensemble method - combines sever!l models into one.

A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)

RANDOM FORESTSTHE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)

MOVIES

20 QUESTIONS

WILL JAMIE LIKE X?

BRIENNE IS THE DECISION TREE FOR JAMIE’S MOVIES PREFERENCES

RANDOM FORESTSTHE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)

Ask T!win, Cersei, T!rion...J"mie #ives e"ch of them sli#htl! different info.

THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES

J"mie dem"nds #ettin# different questions ever! time.

THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES

RANDOM FORESTS• A tree of m"xim"l depth is #rown on " bootstr"p s"mple of

size m of the tr"inin# set. There is no prunin#.

• A number m << p is specified such th"t "t e"ch node, m v"ri"bles "re s"mpled "t r"ndom out of p. The best split of these v"ri"bles is used to split the node into two subnodes.

• Fin"l cl"ssific"tion is #iven b! m"jorit! votin# of the ensemble of trees in the forest.

• Onl! two “free” p"r"meters: number of trees "nd number of v"ri"bles in r"ndom subset "t e"ch node.

RANDOM FORESTSOUT-OF-BAG (OOB) ERRORE"ch bootstr"p s"mple not used in the construction of the tree becomes " test set. The oob error estim"te is #iven b! the miscl"ssific"tion error (MSE for re#ression), "ver"#ed over "ll s"mples.

VARIABLE IMPORTANCE

Determined b! lookin# "t how much prediction error incre"ses when (OOB) d"t" for th"t v"ri"ble is permuted while "ll others "re left unch"n#ed.

RANDOM FORESTS IN R & PYTHON

randomForest PACKAGE

• V"rious implement"tions - randomForest, CARET, PARTY, BIGRF • We follow the KISS procedure - KEEP IT SIMPLE S.• One c"n test v"rious v"lues of mtr! "nd the number of

trees.

Used randomForest p"ck"#e 4.6-7 with R 2.15. Def"ults "re n=500 trees & mtr!= p/3 for re#ression & sqrt(p) for cl"ssific"tion.

RANDOM FORESTS IN R & PYTHONSCIKIT LEARNUsed SCIKIT LEARN 0.14.1 runnin# P!thon version 2.7.5.

COMPUTER: M"cbook Pro 2.53 GHz Intel Core 2 Duo with 4 GB 1067 Mhz DDR3 runnnin# OS X 10.6.8

• Tr"inin# Time• RS$ & RMSE (Re#ression)• Accur"c! (Cl"ssific"tion)

For the comp"rison we will build “sm"ll” forests "nd focus on the followin# simple metrics:

RANDOM FORESTS IN R

RESULTS REGRESSION

Split d"t" in tr"inin# "nd test sets. D"t"fr"me h"s 82,714 rows e"ch "nd 114 columns.

P"r"meters: 60 trees, s"mple of 50,000.

Tr"inin# time: 39.39 min RMSE: 14.587RS$: 0.581

rf  <-­‐  randomForest(training,ratings_train,ntree=60,  sampsize  =  50000,  importance  =  TRUE)

RANDOM FORESTS IN PYTHON

RESULTS REGRESSION

Split d"t" in tr"inin# "nd test sets. D"t"fr"me h"s 82,714 rows e"ch "nd 114 columns.

P"r"meters: 60 trees, s"mple of 50,000.

Tr"inin# time: 3 min 7 sec RMSE: 14.687RS$: 0.575

rf  =  RandomForestRegressor(n_estimators=60,  max_features='sqrt')

RANDOM FORESTS IN R & PYTHON

R

PYTHON / SCIKIT LEARN

RANDOM FORESTS IN RFEATURE IMPORTANCE

FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)

Be!utiful T!lentedBorin# Like Artist

$16 C!tch"C!tch" Be!utiful

T!lented Borin#$9 Tr!ck$19 Distinctive

None of these CoolA#e $11

Tr!ck $12

$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se

$9 - I !m out of touch with new music

$19 - I like to know !bout music before other people

$11 -Pop music is fun

$12 - Pop music helps me esc!pe

Like !rtist - To wh!t extent do "ou like or dislikelistenin# to this !rtist?

RANDOM FORESTS IN RFEATURE IMPORTANCE

RANDOM FORESTS IN PYTHONFEATURE IMPORTANCE

FEATURE IMPORTANCE IN R RANDOM FOREST

Distinctive 7C!tch" 3

Like Artist 2Fun -

T!lented 1Be!utiful 4Ori#in!l -

Unori#in!l -$11 9

Own Artist Music -

Own Artist Music - Do "ou h!ve this !rtist in "our music collection?

$11 -Pop music is fun

RANDOM FORESTS IN R & PYTHON

Model RMSER Random Forest 14.587

Python Scikit Learn Random Forest 14.687

Linear Regression 16.23

Multiple Linear Regs 15.53

RESULTS REGRESSION

RANDOM FORESTS IN RRESULTS CLASSIFICATION

Tr"inin# time: 8.75 min OOB error r"te: 44.01%Accur"c!: 0.567

rf  <-­‐  randomForest(training,ratings_train,ntree=60,  sampsize  =  50000,  importance  =  TRUE)

ratings_train<-­‐as.factor(ratings_train)

1 2 3 4 5

1 16777 4863 1633 139 37

2 5760 12411 6213 504 89

3 1485 5559 13144 1880 329

4 176 888 4094 2592 625

5 59 204 1008 856 1388

RANDOM FORESTS IN PYTHONRESULTS CLASSIFICATION

Tr"inin# time: 2.56 min OOB Score: 0.1964Accur"c!: 0.566

rf  =  sk.RandomForestClassifier(n_estimators=60,compute_importances=True,  oob_score=True)

1 2 3 4 5

1 16930 4682 1758 129 53

2 5517 12369 6475 506 106

3 1500 5367 13448 1737 275

4 186 791 4171 2598 561

5 48 161 999 880 1466

Precision: 0.564Rec"ll: 0.5653F1 Score: 0.5611

RANDOM FORESTS IN RFEATURE IMPORTANCE

FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)

$9 Tr!ck$7 $11$5 $12$6 A#eA#e $6$10 $17

listBACK $9$19 $16

listOWN $4$16 $13

$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se

$9 - I !m out of touch with new music

$19 - I like to know !bout music before other people

$11 -Pop music is fun$12 - Pop music helps me esc!pe

$7 - I enjo" music prim!ril" from #oin# out to d!nce

$5 - I used to know where to find music

$6 - I !m not willin# to p!" for music

$10 - M" music collection is ! source of pride

$4 - I would like to bu" new music but I don’t know wh!t to bu"

$17 - I find seein# ! new !rtist ! useful w!" of discoverin# new music

RANDOM FORESTS IN PYTHONFEATURE IMPORTANCE

FEATURE IMPORTANCE IN R RANDOM FOREST

$11 2$12 3A#e 4$6 5$17 6$5 -$4 9$10 -$16 7$7 -

$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se

$11 -Pop music is fun

$12 - Pop music helps me esc!pe

$5 - I used to know where to find music

$6 - I !m not willin# to p!" for music

$10 - M" music collection is ! source of pride

$4 - I would like to bu" new music but I don’t know wh!t to bu"

$17 - I find seein# ! new !rtist ! useful w!" of discoverin# new music

RANDOM FORESTS IN R1 2 3 4 5 CLASS

1 16777 4863 1633 139 37 28.45%

2 5760 12411 6213 504 89 50.31%

3 1485 5559 13144 1880 329 41.31%

4 176 888 4094 2592 625 69.09%

5 59 204 1008 856 1388 60.51%

CONFUSION MATRIX

RANDOM FORESTS IN PYTHON1 2 3 4 5 CLASS

1 16930 4682 1758 129 53 28.12%

2 5517 12369 6475 506 106 50.47%

3 1500 5367 13448 1737 275 39.77%

4 186 791 4171 2598 561 68.73%

5 48 161 999 880 1466 58.75%

CONFUSION MATRIX

(Re)FORMULATE AN HYPOTHESISSTEP 2

FEATURE SELECTIONPRINCIPAL COMPONENT ANALYSIS - WORDSDetermine which fe"tures "ccount for most of the v"ri"nce.

FEATURE PC1 PC2

Distinctive 0.20 -0.059Authentic 0.19 -0.046T!lented 0.19 -0.083Credible 0.19 -0.084St"lish 0.18 -0.094

Anno"in# -0.06 -0.065Intrusive -0.06 -0.058Irrelev!nt -0.059 -0.087Uninspired -0.056 -0.092

Nois" -0.053 -0.13

FEATURE SELECTIONM"ke " simple model choosin# me"nin#ful v"ri"bles

WORDS - Anno#in", Depressin", Borin", C!tch#, T!lented, Distinctive, Be!utiful, Superst!r, Soulful !nd Popul!r.

QUESTIONS - $4, $5, $6, $9, $10 $11 !nd $19.

• Runnin# time in R ~ 15 min.• RMSE = 14.791 / Public le"der bo"rd 13.076

RESULTS

FULL MODELREDUCED MODEL

COMMENTSIt is well known th!t R!ndom Forests h!ve shown to be bi!sed tow!rds hi"hl# correl!ted v!ri!bles. Usin" condition!l inference trees, !melior!tes th!t bi!s (See Party PACKAGE in R)

SCIKIT learn’s implement!tion h!s n_jobs p!r!meter to p!r!llelise tr!inin". For ! simil!r fe!ture in R, see bigRF p!ck!"e.

CONCLUDING REMARKS

CONCLUDING REMARKS

We solved " problem usin# both R "nd PYTHON (vi" Scikit learn). Cle"rl! constr"ints for "ddressin# " #iven problem mi#ht differ "nd would dict"te the implement"tion of choice.

PICK THE TOOL THAT IS BEST FOR THE JOB

WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS

Both R "nd PYTHON (vi" SCIKIT LEARN) implement"tions h"ve "dded functions th"t "llow the user to explore the resultin# model "nd its perform"nce.

CONCLUDING REMARKSRANDOM FORESTS ARE GREAT

KEEP AN EYE OUT FOR INTERESTING DATA

It "ives "re!t !ccur!c#, c!n h!ndle m!n# fe!tures, does not require cross v!lid!tion !nd it even estim!tes wh!t v!ri!bles !re import!nt.

H!vin" d!t! th!t #ou !re interested in, le!ds to more interestin" questions !nd re!sons to explore new methods !nd !dd ! new trick to #our b!".

CONCLUDING REMARKSEMI DATASET IS GREAT TO TEST RIDE

TO DO’s - WILL IT PYTHON?

Set h!s ! lot of beh!viour!l inform!tion on ! subject th!t ever#one h!s some intuition.

Prediction usin" SVM’s !nd other M!trix F!ctoris!tion techniques. Full f!ctor !n!l#sis, etc.

THANKS!

top related