contrasting offline and online results when evaluating recommendation algorithms
TRANSCRIPT
RecSys Boston,Sept17,2016 1
Contrasting Offline and Online Results when Evaluating
Recommendation AlgorithmsMarcoRossettiTrainline Ltd.,London(previouslyUniversityofMilan-Bicocca)
FabioStellaDepartmentofInformatics,SystemsandCommunicationUniversityofMilano-Bicocca
MarkusZankerFacultyofComputerScienceFreeUniversityofBozen-Bolzano
RecSys Boston,Sept17,2016 2
Research Goal
• Given the dominance of offline evaluation reflecting on its validity becomes important
• Said and Bellogin (RecSys 2014) identified serious problems with the internal validity (not reproducible results with different open source frameworks).
• Different results from offline and online evaluations have also been identified putting question marks on the external validity (e.g. Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et al. 2014, Maksai et al., 2015).
• Proposition:• Compare performance of an offline experimentation with an online
evaluation.• Use of a within-users experimental design, where we can test for
differences in paired samples.
RecSys Boston,Sept17,2016 3
Research Questions
1. Does the relative ranking of algorithms based on offline accuracy measurements predict the relative ranking according to an accuracy measurement in a user-centric evaluation?
2. Does the relative ranking of algorithms based on offline measurements of the predictive accuracy for long- tail items produce comparable results to a user-centric evaluation?
3. Do offline accuracy measurements allow to predict the utility of recommendations in a user-centric evaluation?
RecSys Boston,Sept17,2016 4
Study Design
• Collected likes on ML moviesfrom 241 users
• On average 137 ratings per user
1
• Same users, evaluated 4 algorithms, 5 recommendations each
• On average 17.4 + 2 recommendations• 122 users returned, 100 after cleaning
2
RecSys Boston,Sept17,2016 5
Offline and Online Evaluations
ML1M
All-but-1validation UsersAnswers
Popularity
MF80:MatrixFactorizationwith80factors
MF400:MatrixFactorizationwith400factors
I2I:ItemToItemK-NearestNeighbors
train
Offlineevaluation Onlineevaluation
Metrics
à precision on all items ß
à precision on long tail ß
useful recommendations ß
RecSys Boston,Sept17,2016 6
Precision All Items
MF400 MF80
POP I2I
p = 0.05 p = 0.05 p = 0.05
MF80 MF400
POP I2I
p = 0.05 p = 0.05 p = 0.1
Algorithm Offline OnlineI2I 0.438 0.546
MF80 0.504 0.598MF400 0.454 0.604POP 0.340 0.516
Offlineprecisionallitems
Onlineprecisionallitems
RecSys Boston,Sept17,2016 7
Precision on Long Tail Items
MF80
MF400
POP
I2I
p = 0.05p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
Offline=Onlineprecisionlongtailitems
Algorithm Offline Online
I2I 0.280 0.356MF80 0.018 0.054MF400 0.360 0.628POP 0.000 0.000
RecSys Boston,Sept17,2016 8
Useful Recommendations
MF400I2I
POP
p = 0.05 p = 0.05MF80
p = 0.05 p = 0.05
p = 0.05
Usefulrecommendations
Algorithm OnlineI2I 0.126
MF80 0.082MF400 0.116POP 0.026
RecSys Boston,Sept17,2016 9
Conclusions
• Comparison of different algorithms online and offline based on a within-users experimental design.
• The algorithm performing best according to a traditional offline accuracy measurement was significantly worse, when it comes to useful (i.e. relevant and novel) recommendations measured online.
• Academia and industry should keep investigating this topic in order to find the best possible way to validate offline evaluations.
RecSys Boston,Sept17,2016
Thank you!
10
MarcoRossettiTrainline Ltd.,London@ross85