expediahotelrecommendationpresentation
TRANSCRIPT
1
Expedia Hotel Recommendation
Capstone Project – Spring 2016Presented By : Gurpreet Dhillon
2
Introduction
Which hotel will the user choose?
3
Dataset
• Logs of customer behavior• Source : Kaggle Competitions• 37670293 observations of 24 attributes
4
DatasetColumn name Description Data type
date_time Timestamp string
site_name ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...) int
posa_continent ID of continent associated with site_name int
user_location_country The ID of the country the customer is located int
user_location_region The ID of the region the customer is located int
user_location_city The ID of the city the customer is located int
orig_destination_distancePhysical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated
double
user_id ID of user int
is_mobile 1 when a user connected from a mobile device, 0 otherwise tinyint
is_package 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise int
channel ID of a marketing channel int
srch_ci Checkin date string
srch_co Checkout date string
srch_adults_cnt The number of adults specified in the hotel room int
srch_children_cnt The number of (extra occupancy) children specified in the hotel room int
srch_rm_cnt The number of hotel rooms specified in the search int
srch_destination_id ID of the destination where the hotel search was performed int
srch_destination_type_id Type of destination int
hotel_continent Hotel continent int
hotel_country Hotel country int
hotel_market Hotel market int
is_booking 1 if a booking, 0 if a click tinyint
cnt Number of similar events in the context of the same user session bigint
hotel_cluster ID of a hotel cluster int
5
ApproachData
Exploration and Cleaning
Train and Test dataset
creation
Model creation
Model selection
6
Data Exploration and Cleaning
• Use data.table package in R• Convert attribute datatypes to correct
datatypes• Handle missing values• Univariate Analysis : Find how data is
distributed• Bivariate Analysis : Find correlations between
variables and Hotel Clusters
7
Data Exploration
8
Hotel cluster distribution with hotel continent
9
Hotel Cluster correlation with Distance from Origin
10
Training and Testing Dataset creation
• Select 100 users with maximum number of observations as sample
• Verify that this is a representative sample• Select 80 % sample data for Training ML
models• Select 20% sample data for Testing ML models
Model Creation
• Challenge : Large number of clusters• Solution : H2O package in R
• H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”
• The R H2O package communicates with the H2O JVM over a REST API
• Data is not in R, R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O
11
12
Model Creation
Random Forest• Resample the data over and over • For each sample train a new classifier• Different classifiers overfit the data in a different way• Average out differences through voting
Performance Metrics:Accuracy 0.2066556
Logloss 3.379106
13
Model Creation
GBM• Boosting method which builds on weak classifiers• Add a classifier at a time• Next classifier is trained to improve the already
trained ensemblePerformance Metrics:
Accuracy 0.2697774
Logloss 2.16233
14
Model CreationDeep Learning
15
Model CreationDeep Learning
• Input Layer: Training observations fed here• Hidden Layers: Intermediate layers which help the
Neural Network learn the complicated relationships involved in data
• Output Layer: Final output is extracted from previous two layers
Performance Metrics:
Accuracy 0.2837518
Log Loss 1.545127
16
Results Model/Metric Random Forest GBM Deep LearningAccuracy 0.2066556 0.2697774 0.2837518Kappa 3.595714 3.616561 3.570305
R^2 0.9989192 0.9991958 0.9994364Logloss 3.379106 2.16233 1.545127
Accuracy: The fraction of instances that are correctly classified. Kappa: Comparison of the overall accuracy to the expected random chance accuracy. R^2: Explains the variance in the dependent variable (hotel_clusters) as explained by independent variables Log loss: Quantifies the accuracy of a classifier by penalizing false classifications.
17
Conclusion
• Deep Learning model :best fit to make predictions on hotel clusters
• Expedia can show hotels that are likely to be booked on Expedia home page
• Hotel recommendation can also be included in emails sent to customers
18
Conclusion
Satisfied and Loyal customers!!
19
Thank you!!