anthony goldbloom ceo, kaggle e-mail [email protected] twitter @antgoldbloom predictive...
TRANSCRIPT
Anthony GoldbloomCEO, Kagglee-mail [email protected] @antgoldbloom
Predictive modeling competitions
Photo by mikebaird, www.flickr.com/photos/mikebaird
making data science a sport
1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize
Global competitions
1½ weeks 70.8%
Competition closes 77%
State of the art 70%
Predicting HIV viral load
Mismatch between those with data andthose with the skills to analyse it
Crowdsourcing
Countless approaches. Hard to know which will work
Additional slidesNot MIT, not SAS … UoL?
Forecast Error(MASE)
Existing model
Tourism Forecasting Competition
Aug 9 2 weeks later
1 month later
Competition End
Existing model (ELO)
Chess Ratings Competition
Aug 4 1 monthlater
2 monthslater
Today
Error Rate(RMSE)
Our User Base
• neural networks• logistic regression• support vector machine• decision trees• ensemble methods• adaBoost• Bayesian networks
• genetic algorithms• random forest• Monte Carlo methods• principal component analysis• Kalman filter• evolutionary fuzzy modeling
Users apply different techniques
1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize
Clean, Real world data Professional Reputation & Experience
Interactions with experts in related fields Prizes
1
4
2
3
Why Participants Compete
More fun than Sudoku
1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize
Competitions are judged based on predictive accuracy
Competition Mechanics
Competitions are judged on objective criteria
1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize
R on Kaggle
R on Kaggle among academics
R on Kaggle among Americans
Number Name Winner Packages
4HIV Progression Prediction Chris Raimondi
Caret (RFE and RandomForest)
5 Informs 2010 Cole Harris GLM, NNET6 Chess Rating Yannis Sismanis
7
Tourism Forecasting Part 2 Phil Brierley Forecast
10R Package Recommendation Max Lin
Stats, ROCR, GGPlot, GGPlot2
13 Ford Stay Alert Edward Stats
Who Uses R and How
1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize
MembId AgeAtFirstClaim Sex
25872 19- Oct F
MembId DaysInHospital
25872 0
MembId ProviderId Vendor PCP YearSvc Specialty Place PayDelay LengthOfStayDSFS PrimaryConditionGroupCharIndexClaimID25872 171278567 7891165 294037 Y1 Internal Office 22 0- 1 month RESPR4 1- 2 125872 376108719 5024957 294037 Y1 Laboratory Independent Lab 23 0- 1 month MSC2a3 0 225872 171278567 7891165 294037 Y1 Internal Office 16 1- 2 months RESPR4 1- 2 325872 171278567 7891165 294037 Y1 Internal Office 19 2- 3 months RESPR4 1- 2 425872 171278567 7891165 294037 Y1 Internal Office 21 3- 4 months RESPR4 1- 2 525872 171278567 7891165 294037 Y1 Internal Office 21 4- 5 months RESPR4 1- 2 625872 376108719 5024957 294037 Y1 Laboratory Independent Lab 11 7- 8 months METAB3 1- 2 7
Mmm… how do I put this into R?
Some SQL Magic
Gives us a flat record
MembId DaysInHospital AgeAtFirstClaim Sex maxlos numclaims inhosp urgent25872 0 19- Oct F 7 0 0
Voila, an entry!
Photo by gidzy, www.flickr.com/photos/gidzy
What could the world’s bestanalysts find in your data?e-mail [email protected] +61438400053