expediahotelrecommendationpresentation

1

Expedia Hotel Recommendation

Capstone Project – Spring 2016Presented By : Gurpreet Dhillon

2

Introduction

Which hotel will the user choose?

3

Dataset

• Logs of customer behavior• Source : Kaggle Competitions• 37670293 observations of 24 attributes

4

DatasetColumn name Description Data type

date_time Timestamp string

site_name ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...) int

posa_continent ID of continent associated with site_name int

user_location_country The ID of the country the customer is located int

user_location_region The ID of the region the customer is located int

user_location_city The ID of the city the customer is located int

orig_destination_distancePhysical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated

double

user_id ID of user int

is_mobile 1 when a user connected from a mobile device, 0 otherwise tinyint

is_package 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise int

channel ID of a marketing channel int

srch_ci Checkin date string

srch_co Checkout date string

srch_adults_cnt The number of adults specified in the hotel room int

srch_children_cnt The number of (extra occupancy) children specified in the hotel room int

srch_rm_cnt The number of hotel rooms specified in the search int

srch_destination_id ID of the destination where the hotel search was performed int

srch_destination_type_id Type of destination int

hotel_continent Hotel continent int

hotel_country Hotel country int

hotel_market Hotel market int

is_booking 1 if a booking, 0 if a click tinyint

cnt Number of similar events in the context of the same user session bigint

hotel_cluster ID of a hotel cluster int

5

ApproachData

Exploration and Cleaning

Train and Test dataset

creation

Model creation

Model selection

6

Data Exploration and Cleaning

• Use data.table package in R• Convert attribute datatypes to correct

datatypes• Handle missing values• Univariate Analysis : Find how data is

distributed• Bivariate Analysis : Find correlations between

variables and Hotel Clusters

7

Data Exploration

8

Hotel cluster distribution with hotel continent

9

Hotel Cluster correlation with Distance from Origin

10

Training and Testing Dataset creation

• Select 100 users with maximum number of observations as sample

• Verify that this is a representative sample• Select 80 % sample data for Training ML

models• Select 20% sample data for Testing ML models

Model Creation

• Challenge : Large number of clusters• Solution : H2O package in R

• H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”

• The R H2O package communicates with the H2O JVM over a REST API

• Data is not in R, R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O

11

12

Model Creation

Random Forest• Resample the data over and over • For each sample train a new classifier• Different classifiers overfit the data in a different way• Average out differences through voting

Performance Metrics:Accuracy 0.2066556

Logloss 3.379106

13

Model Creation

GBM• Boosting method which builds on weak classifiers• Add a classifier at a time• Next classifier is trained to improve the already

trained ensemblePerformance Metrics:

Accuracy 0.2697774

Logloss 2.16233

14

Model CreationDeep Learning

15

Model CreationDeep Learning

• Input Layer: Training observations fed here• Hidden Layers: Intermediate layers which help the

Neural Network learn the complicated relationships involved in data

• Output Layer: Final output is extracted from previous two layers

Performance Metrics:

Accuracy 0.2837518

Log Loss 1.545127

16

Results Model/Metric Random Forest GBM Deep LearningAccuracy 0.2066556 0.2697774 0.2837518Kappa 3.595714 3.616561 3.570305

R^2 0.9989192 0.9991958 0.9994364Logloss 3.379106 2.16233 1.545127

Accuracy: The fraction of instances that are correctly classified. Kappa: Comparison of the overall accuracy to the expected random chance accuracy. R^2: Explains the variance in the dependent variable (hotel_clusters) as explained by independent variables Log loss: Quantifies the accuracy of a classifier by penalizing false classifications.

17

Conclusion

• Deep Learning model :best fit to make predictions on hotel clusters

• Expedia can show hotels that are likely to be booked on Expedia home page

• Hotel recommendation can also be included in emails sent to customers

18

Conclusion

Satisfied and Loyal customers!!

19

Thank you!!

expediahotelrecommendationpresentation

Documents