introduction to r package recommendation system competition

25
R Recommendation System Contest John Myles White March 10, 2011 John Myles White R Recommendation System Contest

Upload: nyc-predictive-analytics

Post on 20-Jun-2015

3.090 views

Category:

Documents


0 download

DESCRIPTION

John Myles White's Introduction to R Package Recommendation System Competition

TRANSCRIPT

Page 1: Introduction to R Package Recommendation System Competition

R Recommendation System Contest

John Myles White

March 10, 2011

John Myles White R Recommendation System Contest

Page 2: Introduction to R Package Recommendation System Competition

Kaggle

Kaggle is a platform for data prediction competitionsthat allows organizations to post their data and have itscrutinized by the world’s best data scientists.

John Myles White R Recommendation System Contest

Page 3: Introduction to R Package Recommendation System Competition

Kaggle Features

Kaggle provides every contest with:

I Centralized data downloads

I Public and private leaderboards using RMSE, AUC and othermetrics

I Public discussion forums for participants to use

John Myles White R Recommendation System Contest

Page 4: Introduction to R Package Recommendation System Competition

Kaggle Features

John Myles White R Recommendation System Contest

Page 5: Introduction to R Package Recommendation System Competition

Recent Kaggle Contests

I Tourism Forecasting

I Chess Ratings: Elo versus the Rest of the World

I INFORMS 2010: Short Term Stock Price Movements

John Myles White R Recommendation System Contest

Page 6: Introduction to R Package Recommendation System Competition

Current and Upcoming Kaggle Contests

I Arabic Writer Identification

I Don’t Overfit: Dealing with Many Variables and FewObservations

I Heritage Health Prize

John Myles White R Recommendation System Contest

Page 7: Introduction to R Package Recommendation System Competition

Advice on Running Kaggle Contests

I Stay involved: respond to forum posts quickly and make thecontest seem alive

I Don’t use a prediction task where near perfect accuracy canbe achieved

John Myles White R Recommendation System Contest

Page 8: Introduction to R Package Recommendation System Competition

Mistakes We Made

I Netflix Prize: 0.8616 RMSE

I R Recommendation Contest: 0.9882 AUC

John Myles White R Recommendation System Contest

Page 9: Introduction to R Package Recommendation System Competition

The R Recommendation System Contest

I Contestants must be able to predict whether a user U willhave a package P installed on their system

John Myles White R Recommendation System Contest

Page 10: Introduction to R Package Recommendation System Competition

Full Data Set

I Outcomes: List of all packages installed on 52 R users’systems

I Predictors: Metadata about 2485 CRAN packages

John Myles White R Recommendation System Contest

Page 11: Introduction to R Package Recommendation System Competition

Metadata

I Dependencies

I Suggests

I Imports

I Views

I Core

I Recommended

I Maintainer

I Maintainer’s Package Count

John Myles White R Recommendation System Contest

Page 12: Introduction to R Package Recommendation System Competition

Training Data / Test Data Split

I Uniform random split over rows in full data set

I Training Set: 99373 rows

I Test Set: 33125 rows

John Myles White R Recommendation System Contest

Page 13: Introduction to R Package Recommendation System Competition

Additional Metadata

I LDA topic assignments for CRAN packages

I Used 25 topics

I Used all documentation: manuals, vignettes, etc.

John Myles White R Recommendation System Contest

Page 14: Introduction to R Package Recommendation System Competition

Example Models

1. Package Metadata

2. Package Metadata + Per User Intercepts

3. Package Metadata + Per User Intercepts + Package TopicAssignments

John Myles White R Recommendation System Contest

Page 15: Introduction to R Package Recommendation System Competition

Example Model 1

library(‘ProjectTemplate’)try(load.project())

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage,

data = training.data,family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Page 16: Introduction to R Package Recommendation System Competition

Example Model 2

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User),

data = training.data,family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Page 17: Introduction to R Package Recommendation System Competition

Example Model 3

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User) +Topic,

data = training.data,family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Page 18: Introduction to R Package Recommendation System Competition

Model Performance

I Model 1: ∼ 0.80 AUC

I Model 2: ∼ 0.95 AUC

I Model 3: > 0.95 AUC

John Myles White R Recommendation System Contest

Page 19: Introduction to R Package Recommendation System Competition

Unexploited Structure in Data

John Myles White R Recommendation System Contest

Page 20: Introduction to R Package Recommendation System Competition

Future Work

What makes a package useful?

I Need subjective ratings

I Some packages are only installed because they’redependencies for other popular packages

John Myles White R Recommendation System Contest

Page 21: Introduction to R Package Recommendation System Competition

Future Work

Get a better data sample:

I Contest only used data from 52 users

I But we do have complete data for those users

I But data was not a random sample of R users

John Myles White R Recommendation System Contest

Page 22: Introduction to R Package Recommendation System Competition

Future Work

I Do more with LDA to categorize R packages

I Prediction task allows us to evaluate “quality” of topics countand topic assignments

John Myles White R Recommendation System Contest

Page 23: Introduction to R Package Recommendation System Competition

Future Work

I Build up various package-package similarity matrices forconditional recommendations

John Myles White R Recommendation System Contest

Page 24: Introduction to R Package Recommendation System Competition

Future Work

I Can we understand the clustering in the network structuregraph?

John Myles White R Recommendation System Contest

Page 25: Introduction to R Package Recommendation System Competition

Resources

For more information, see

I The original Dataists’ contest announcement

I GitHub project page

John Myles White R Recommendation System Contest