expediahotelrecommendationpresentation

19
Expedia Hotel Recommendation Capstone Project – Spring 2016 Presented By : Gurpreet Dhillon 1

Upload: gurpreet-dhillon

Post on 15-Apr-2017

129 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: ExpediaHotelRecommendationPresentation

1

Expedia Hotel Recommendation

Capstone Project – Spring 2016Presented By : Gurpreet Dhillon

Page 2: ExpediaHotelRecommendationPresentation

2

Introduction

Which hotel will the user choose?

Page 3: ExpediaHotelRecommendationPresentation

3

Dataset

• Logs of customer behavior• Source : Kaggle Competitions• 37670293 observations of 24 attributes

Page 4: ExpediaHotelRecommendationPresentation

4

DatasetColumn name Description Data type

date_time Timestamp string

site_name ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...) int

posa_continent ID of continent associated with site_name int

user_location_country The ID of the country the customer is located int

user_location_region The ID of the region the customer is located int

user_location_city The ID of the city the customer is located int

orig_destination_distancePhysical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated

double

user_id ID of user int

is_mobile 1 when a user connected from a mobile device, 0 otherwise tinyint

is_package 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise int

channel ID of a marketing channel int

srch_ci Checkin date string

srch_co Checkout date string

srch_adults_cnt The number of adults specified in the hotel room int

srch_children_cnt The number of (extra occupancy) children specified in the hotel room int

srch_rm_cnt The number of hotel rooms specified in the search int

srch_destination_id ID of the destination where the hotel search was performed int

srch_destination_type_id Type of destination int

hotel_continent Hotel continent int

hotel_country Hotel country int

hotel_market Hotel market int

is_booking 1 if a booking, 0 if a click tinyint

cnt Number of similar events in the context of the same user session bigint

hotel_cluster ID of a hotel cluster int

Page 5: ExpediaHotelRecommendationPresentation

5

ApproachData

Exploration and Cleaning

Train and Test dataset

creation

Model creation

Model selection

Page 6: ExpediaHotelRecommendationPresentation

6

Data Exploration and Cleaning

• Use data.table package in R• Convert attribute datatypes to correct

datatypes• Handle missing values• Univariate Analysis : Find how data is

distributed• Bivariate Analysis : Find correlations between

variables and Hotel Clusters

Page 7: ExpediaHotelRecommendationPresentation

7

Data Exploration

Page 8: ExpediaHotelRecommendationPresentation

8

Hotel cluster distribution with hotel continent

Page 9: ExpediaHotelRecommendationPresentation

9

Hotel Cluster correlation with Distance from Origin

Page 10: ExpediaHotelRecommendationPresentation

10

Training and Testing Dataset creation

• Select 100 users with maximum number of observations as sample

• Verify that this is a representative sample• Select 80 % sample data for Training ML

models• Select 20% sample data for Testing ML models

Page 11: ExpediaHotelRecommendationPresentation

Model Creation

• Challenge : Large number of clusters• Solution : H2O package in R

• H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”

• The R H2O package communicates with the H2O JVM over a REST API

• Data is not in R, R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O

11

Page 12: ExpediaHotelRecommendationPresentation

12

Model Creation

Random Forest• Resample the data over and over • For each sample train a new classifier• Different classifiers overfit the data in a different way• Average out differences through voting

Performance Metrics:Accuracy 0.2066556

Logloss 3.379106

Page 13: ExpediaHotelRecommendationPresentation

13

Model Creation

GBM• Boosting method which builds on weak classifiers• Add a classifier at a time• Next classifier is trained to improve the already

trained ensemblePerformance Metrics:

Accuracy 0.2697774

Logloss 2.16233

Page 14: ExpediaHotelRecommendationPresentation

14

Model CreationDeep Learning

Page 15: ExpediaHotelRecommendationPresentation

15

Model CreationDeep Learning

• Input Layer: Training observations fed here• Hidden Layers: Intermediate layers which help the

Neural Network learn the complicated relationships involved in data

• Output Layer: Final output is extracted from previous two layers

Performance Metrics:

Accuracy 0.2837518

Log Loss 1.545127

Page 16: ExpediaHotelRecommendationPresentation

16

Results Model/Metric Random Forest GBM Deep LearningAccuracy 0.2066556 0.2697774 0.2837518Kappa 3.595714 3.616561 3.570305

R^2 0.9989192 0.9991958 0.9994364Logloss 3.379106 2.16233 1.545127

Accuracy: The fraction of instances that are correctly classified. Kappa: Comparison of the overall accuracy to the expected random chance accuracy. R^2: Explains the variance in the dependent variable (hotel_clusters) as explained by independent variables Log loss: Quantifies the accuracy of a classifier by penalizing false classifications.

Page 17: ExpediaHotelRecommendationPresentation

17

Conclusion

• Deep Learning model :best fit to make predictions on hotel clusters

• Expedia can show hotels that are likely to be booked on Expedia home page

• Hotel recommendation can also be included in emails sent to customers

Page 18: ExpediaHotelRecommendationPresentation

18

Conclusion

Satisfied and Loyal customers!!

Page 19: ExpediaHotelRecommendationPresentation

19

Thank you!!