application of data mining for identifying and predicting ...application of data mining for...
TRANSCRIPT
i
ApplicationofDataMiningforidentifyingandpredictingroombookings
CatarinaCaldeiraMaçaroco
WindowofOpportunity
ProjectWorkreportpresentedaspartialrequirementforobtainingtheMaster’sdegreeinInformationManagementandBusinessIntelligence.
i
NOVAInformationManagementSchool
InstitutoSuperiordeEstatísticaeGestãodeInformaçãoUniversidadeNovadeLisboa
APPLICATIONOFDATAMININGFORIDENTIFYINGANDPREDICTING
WHICHEVENTSLEADTOROOMBOOKINGS–
WINDOWOFOPPORTUNITY
by
CatarinaMaçaroco
ProjectWorkpresentedaspartialrequirementforobtainingtheMaster’sdegreeinInformationManagement,withaspecializationinKnowledgeManagementandBusinessIntelligence
Advisor:RobertoHenriques,PhD
February2019
ii
ABSTRACT
Thisstudyinvestigatesthesearchpatternsforpredictinghotelbookings.UsingExpedia’ssearchandpurchasedata,Iidentifiedtheuser’sbookingwindowandwhicheventshaveahighereffectonthebookinglikelihood.
Thetourismindustryhasseenexponentialgrowthoverthelasttwodecades,muchduetoglobalsocio-economicchanges,globalizationandinternetmassification.Portugalisfinallyreapingitsshareofprofitandcontinueswinning“bestdestination”prizesyearafteryear.Itisnowpartofaverycompetitiveecosystemwheredistributionplaysadeterminantrole inwhetherthetouristicproductsurvivesornot. Big players like Expedia and Booking.com have taken control of a big chunk of themarket’srevenue because they understood that the large amounts of data they have enabled predictingdemandhenceofferinghighlycompetitivedeals.
Only by understanding the booking drivers one can negotiate better distribution deals and lowercommissions through making better sales predictions and enhancing marketing and revenuestrategies,hencethepurposeofthisstudybeingtousedataminingtofindpatternsinroombookings,enablingthisindustrytobecomeanevenmoreimportantsourceforthecountry’sGDP.
Throughtheanalysisofconsumerbehaviorandbookingtimesandtheuseoftimeseriesanalysisandmachinelearning,itispossibletofindpatternsthatcanbeappliedacrossnotonlyHospitalitybutotherindustriesaswell.
By looking for the connections and the relevance of each feature on the final predictions, a newwindowofopportunitywillopenformarketingandsalesprofessionals.
KEYWORDS
Salestrends;Micromoments;Macromoments;ZeroMomentofTruth;BookingWindow
iii
INDEX
1. Introduction.................................................................................................................1
1.1.Backgroundandproblemidentification...............................................................1
1.2.StudyRelevance...................................................................................................2
1.3.StudyObjectives...................................................................................................3
2. Literaturereview.........................................................................................................4
3. Methodology...............................................................................................................7
3.1.ProblemDefinition...............................................................................................8
3.2.DataCollection.....................................................................................................8
3.3.DataSamplingandCleaning.................................................................................8
3.4.Software...............................................................................................................8
3.5.Selectingandfittingmodels.................................................................................8
3.6.Validatingresults..................................................................................................93.7.Featureengineering.............................................................................................9
3.8.Models................................................................................................................10
4. Resultsanddiscussion...............................................................................................12
4.1.DescriptiveAnalysis............................................................................................12
4.2.PredictiveAnalysis..............................................................................................12
4.3.PrescriptiveAnalysis...........................................................................................15
5. Conclusions................................................................................................................16
6. Limitationsandrecommendationsforfutureworks.................................................17
7. Bibliography...............................................................................................................18
8. Appendix....................................................................................................................21
8.1.Appendix1–DescriptiveAnalysis......................................................................21
8.2.Appendix2–PredictiveAnalysis........................................................................24
8.3.Appendix3-Automation...................................................................................26
iv
LISTOFFIGURES
Figure1-Workflowdiagram.....................................................................................................7
Figure2-Dataikuworkflow....................................................................................................11Figure3-Searchwindowperweekpercontinent.................................................................21
Figure4-Searchwindowwhenbookingwithpackageorwithoutpackage..........................21
Figure5-Searchwindowifbookingincludesweekendordoesnotincludeweekend..........22Figure6-Searchwindowwhenbookingwithchildren..........................................................22
Figure7-Numberofbookingsincludingchildren..................................................................23
Figure8-Numberofbookingspercontinent.........................................................................23Figure9-Variablesimportance..............................................................................................25
Figure10-Bookinglikelihood.................................................................................................25
Figure11-Automationworkflow...........................................................................................26
v
LISTOFTABLES
Table1-Hyperparameters.......................................................................................................9
Table2-Cohorts’results.........................................................................................................13Table3-Algorithmdetails......................................................................................................14
Table4-Qualitymetrics.........................................................................................................14
Table5-VariablesExplanation...............................................................................................24Table6-Models'results.........................................................................................................24
vi
LISTOFABBREVIATIONSANDACRONYMS
API ApplicationProgrammingInterface
ARIMA AutoregressiveIntegratedMovingAverage
AHRESP AssociaçãodaHotelariaRestauraçãoeSimilaresdePortugal
CPC Costperclick
CMS ContentManagementSystem
GDP GrossDomesticProduct
LSTM Longshort-termmemory
MCC MatthewsCorrelationCoefficient
OTA Onlinetravelagent
ROCAUC ReceiverOperatingCharacteristicAreaUndertheCurve
ROI Returnoninvestment
SETAR Self-ExcitingThresholdAutoregressive
XGBOOST ExtremeGradientBoosting
ZMOT Zeromomentoftruth
1
1. INTRODUCTION
Competitionforroombookingshasneverbeenthisfierce.FromHotelstodestinationmanagementcompaniestoOTAs,everycompanyiseagertotakeitscutfromthesellingprice(KevinMay,2017).
This study aims to use popular machine learning models to predict likelihood to book and actaccordinglytothatprobability.
Throughfixedcommissions,costperacquisitionorfixedfees,lodgingandaccommodationbusinessesarecompletelydependentonthirdpartybookingagents.Refusingfrombeingpartofthesystemwouldleadtocompleteisolationinamarketthathasbeengrowingatarateof2,5%peryear(INE,2017)interms of number of Hotels and where investors are confident in continuously stronger revenues,where tourism now represents 10% of the PortugueseGDP (Ferreira, 2017), currently the highestsource.
Analysingthecustomerbehaviourandpredictingthebookingwindowisessentialtogainacompetitiveadvantage in amarketwhere advertising spent is reachingmonstrous levels, with companies likeBooking.comspending$3,5billioninPayperclicklastyear(KevinMay,2017).
XBBoost proved to be theModel that performed best and provided themost accurate predictionpower.Theresultssuggestthatbyaddingfurtherrelevantvariablesthemodel’spredictivepowerwillincrease. I was able of identifying the users that aremost likely and least likely to book, creatingdifferentsegmentswhichcanbeverypowerfultogenerateeffectivestrategies, impactingtherightuserswiththerightnudgesthatleadtomorebookings,hencemorerevenue.
Withtheidentificationoftheexactperiodsandvariablesthatleadtoabooking,theadvertisingspentcanbebetterallocatedandrevenuestrategiescanbecomemoreeffective(TrefisTeam,2016).
1.1. BACKGROUNDANDPROBLEMIDENTIFICATION
WiththeincreasingimportanceoftourismbothgloballyandespeciallyforPortugalanditsGDP,itisessentialtotakefulladvantageoftheemergingopportunities.
Thetourismandhospitalitymarketsarewidelydiverseandcomplexmainlybecausetheydealwiththe human need for emotions and experiences (Wu, Mattila, & Hanks, 2015). When looking foraccommodationforeitherbusinessorleisure,theindividualismovedbyasenseofurgency,aneedforrelaxation,asuddenrushforsomethingnew,justtonameafewdrivers.Thesedriversarewhatmarketersarethrivingtograsp,becausedoingsomeansthatonereachestherevenuesourcefirst,henceavoidingorgainingcommissions.
Eventsthatleadtobookingsrangefrommundaneeverydayeventssuchasfeelingtiredafteralongdayandinneedforanescape,orSundaylunchwithfamilyleadingtobookingatrip.Itcanalsobecausedbyextremeevents,liketheEurocup’sfinal,ChristmasorNewYears’Eve.Still,forecastingforextremeeventsisextremelydifficultduetotheirlackoffrequency,somethingthatUberisinvestedinsolving(Laptev,Smyl,&Shanmugam,2017).
Companiesareusingonlinemarketingtoreachtheaudienceandcaptivatetheusers’interestbeforeothers (Torres, Singh,&Robertson-Ring, 2015), yet themarket is fierce and competitive. Knowing
2
whencustomersbookisachancetouseGuerrillamarketingeffectively,butititisextremelydifficultandtimeconsumingtoplacetherightadvertinfrontoftherightconsumer,attherighttime(Hudson&Thal,2013).
ConsideringthatBooking.comisspending$3,5billioninCostPerClick(CPC),andiscurrentlyoneofGoogleAdwords’biggestclient(KevinMay,2017),mostcertainlyithasbeenconductingthesestudiesforquiteawhile.Nevertheless,itisnotintheirinteresttosharethisknowledge,hencethereasonfortheidentificationofthisproblemasonethatcanbetackledindependentlyinthisstudy.
Multipleapproacheshavebeenexploredbyrevenuemanagers,throughtheexplorationofhistoricaldataand relating thiswithpastand futureevents takingplace in the locationsofbuyerandseller(Constantino,Fernandes,&Teixeira,2016),yetitisnotanexactsciencethanratheritisahunch.
Thisstudywilltrytogoastepfurtherandpullthedatathatdeliverstrueinsightsandvalue,inthiscase,customerbehaviour.Customerbehaviouranalytics iswhatdrovetraditionalmarketingtothesecond plan and brought digital marketing into play. Contrary to traditional marketing, digitalmarketing’scorevalueisthateverythingcanbemeasuredthusenablingamuchcomprehensiveviewoftheclientandefficientstrategiestogeneratevalue(JulieCave,2016).
Google and Facebook, for example, store gigantic amounts of data about every single individual(Bangwayo-Skeete&Skeete,2015), this is their leverage fordelivering therightadvertising, to therightcustomer,attherighttime,leadingtoconversionsandadvertisersrelianceontheireffectivenessduetothehighROIresults.
Itisnowuptoeachcompanytodetangletheinformationtheypossess(salesdata)correlateittocity’sopendataandfindthepatternsthatcanleadtocompetitiveadvantage(PatrickWhyte,2017).
Thefollowingvariableswillbeconsideredinthisstudy:
-Salesdata
-Consumerbehaviour
-Economicsituation(individual,localandglobal)
-Socio-economicevents
Applyingtimeseriesandmachinelearningtosalesforecastenablesfindingthepatternsthatcanbetranslatedintobookingwindows(Bangwayo-Skeete&Skeete,2015).
Therelevanceofthisstudyanditsmainobjectiveswillbefurtherdescribedinthefollowingchapters.
1.2. STUDYRELEVANCE
Thetourismindustrywillemploy10%ofthepopulationworldwidebytheyear2027.InPortugalnow,thisindustryrepresentsalmost1millionjobs,numberswhichareexpectedtogrowto1.034.000jobs,representing22,6%ofthetotalemployedpopulationby2027(JoanaNunesMateus,2017).
Representing17%ofthePortugueseGDPandanindustrywherealmostaquarterofthePortuguesepopulationrelyon,onecaneasilycalculatethatthecommonlycharged10-25%commissiononevery
3
salestransactionthroughintermediariessuchasBooking.comandExpediafordistribution,isrevenueinasense“lost”andthatcouldbepartiallyreacquiredifproperdataexplorationwasdone.
It is therefore of great importance that independent studies start being developed regarding thissubject,onlythroughcompleteawarenessofthesalesdriverscanonestayafloatandminimisetheadvertisingandcommissions’expenditureandownback itspowerandstrategydefinition,withoutbeingatthemercyofthird-partyvendors
1.3. STUDYOBJECTIVES
Themainquestionofthisstudyisposedas:Isitpossibletopredictthebookingwindowsusingsearchandbookingdata?
Theanalysisismadeonsaleshistoricaldataandsearchtrends.Thesewillaimtoanswerthefollowingspecificobjectives:
-WhichtrendscanbefoundthroughthedataprovidedbyExpediathroughKaggle?
Data from the Kaggle competition concerning the time of booking and for which dates will beanalysedtofindpatternsthatwillthenbecorrelatedwithothervariables.
-Whenisthebesttimetoreleasemarketingcampaigns?
Bycrossingthebookingtimeswithspecificvariables/timesofday/dayoftheweek, it ispossibletodetectthebookingspecificdrivers,thatenableagoodpredictionoftherighttimetoimpacttheuser.
-Whichdataimprovesdemodel’spredictabilityperformance?
Variousdatafeatureswillbeanalysed,themodel’saccuracywillvarydependingonwhichfeaturesareusedandthiswillultimatelyhaveaneffectandleadtoreachingthestudy’sobjective.Thiswillonlybepossibletotestoncevariousmodelsarecreated,ranandanalysed.
-Whichcorrelationscanbefoundtoaugmentdata’svalue?
Aftercollectingallthedatanecessaryforthisstudy,acorrelationbetweenthedatafeatureswillbemadetoseeiftheeventdidinfluencetheroombookings.
-Whichmodelbetterfitsthedata?
Severalmodelswillbeappliedtothedatatodetectthemodelwiththebestpredictiveoutcome.
4
2. LITERATUREREVIEW
Thereareseveralpublishedresearchpapersonthetopicofforecastingtourismdemandandtrends,theirtechniquesandapproacheswerestudiedandtestedtoenablethisstudy.
Thisstudy’sobjectiveistoevaluatetheforecastingperformanceofartificialneuralnetworksrelativeto different time series models using Expedia’s search and booking data from Kaggle’s 2013competition(Kaggle,2015).
TheTourismindustrycontributedUS$7.6trilliontotheglobaleconomy,10.2%ofglobalGDP(Misrahi&Crotti,2018).InPortugalthecontributionofTourismtothecountry’sGDPisof7,1%,thereisaclearopportunity for growth when compared with the global landscape (INE, 2017). Tourism is one oftoday’sfastestgrowingeconomicactivitieshencetourismdemandforecastingbecomingessentialtomonitorandpredicttourisminflux,revenueforecastingandbudgetallocation.TheresearchersSongand Lee found it crucial to improve the accuracy and performance of analysis methodsbyexperimentingwithnewapproaches(Chan,Witt,Lee,&Song,2009).
It isnotpossibletostocktheunfilledairlineseats,unoccupiedhotelrooms,orunusedconcerthallseats.Duetotheperishablenatureofthetourismindustry,theneedforaccurateforecastsiscrucial(Law&Au,1999).
ArelevantarticleforthisstudyisForecastingtourismdemandtoCatalonia:Neuralnetworksvs.timeseriesmodels(Claveria&Torra,2014),inwhichtimeseriesandartificialneuralnetwork(NN)modelsareusedtoextractpatternsandpredictiveresultsfromCatalonia’stourismdemand.Therehasbeenanincreasedinterestinmoreadvancedpredictivetechniquesfortourismdemand.Whichistiedwithtourism becoming an increasingly stronger global industry. The use of Artificial Intelligence (AI)techniquesfordataanalysishasbeengrowingduetotheneedformorereliableandaccurateforecastsof tourismdemand that candealwith increasingcomplexity.This ismainlybecauseAImodelsarebettercapableofdealingwithnonlinearbehavior,characteristicoftraveldata,inthiscase,bookingsdata.Still,whencomparingtheforecastingaccuracyofthedifferentmodels,AutoregressiveIntegratedMoving Average (ARIMA) outperformed Self-Exciting Threshold Autoregressive (SETAR) and ANNmodels,especiallyforshortertimespans.Theoriginaldatasetpre-processingmaybethereasonfortheseresults,wheretherewasinformationlosswhenaccountingforthepresenceofseasonalityandeliminatingoutliers,whichleadstoaloweraccuracyofneuralnetworkforecasts.Neuralnetworkscanbeimprovedthroughstructureoptimization,addinglayersandmemoryvalues,hencefutureresearchneeds to consider whether the implementation of optimised neural networks and advances ondynamicnetworksdoimprovetourismdemandforecasting.
Inthepastfewyearsduetonewandmoreadvancedforecastingtechniques,suchasNeuralnetworksandGradientBoosting,and theneed formoreaccuratemetricsof tourismdemandthe interest inArtificial Intelligence (AI) and experimentation with those same techniques grew, mainly becauseoftheircapabilityofhandlingnonlinearbehaviour(Pai,Hung,&Lin,2014).
Theincreasingavailabilityoftechnologyatlowercostsenablesaswelltheever-wideradoptionandexperimentation.Technologycompanies,e.g.Google,Facebook,AmazonandUberarenowprovidingopensourcesoftwarethatenablesdevelopersanddatascientiststofurtherexplorethecapabilitiesofArtificialIntelligence(AI)andMachineLearning(ML),thisleadstohigherinterestincreatingbetter
5
predictive solutions (Mukherjee & Lakshmanan, 2017). With cloud computing, it is possible tohave efficient energetic usage and ease in scaling and cost savings (Zhang, 2016). Upfront costcommitmentismuchlowernowwhencomparedtoon-premisessolutions.
Whenconsideringopendata,onestudyanalyzedGoogle’sTrendsdataandexaminedtheusefulnessof “hotels” “flights” and “destination country” search indicators andmeasured inwhat extent thesearchqueriesdataimprovedtheARandtheSARIMAmethodspredictingovernighttouristarrivals.ThetwelvemonthforecastresultsrevealthatAR-MIDASmodelsgavesuperiorpredictionstoARandSARIMA time seriesmodels in terms of the RootMean Squared Error (RMSE) andMeanAbsolutePercentError(MAPE)forecastingcriteria.
Googlequerysearchdatacanbeusedtoaccuratelyprojectfuturetouristarrivalsoverayear'shorizon.Thisstudycontributedtothegrowinginterestinforecastingusingwebtrafficdatamakingitrelevanttootherindustriesthatcanbenefitfromtheanalysisofwebsearchvolumehistoriestopredictusefultrends(Bangwayo-Skeete&Skeete,2015).
In2014astudylookedintoGoogle’sSearchDataandexaminedtheusefulnessofsearchindicatorssuchas“hotels”and“flights”andtesteditsimpactonthesimpleAutoregreessivemethod(AR)andtheSeasonalAutoregressiveIntegratedMovingAverage(SARIMA)(Bangwayo-Skeete&Skeete,2015).
The Tourism forecast combination using the CUSUM technique article (Chan et al., 2009)demonstratedthatthereisnosinglemethodthatoutperformedothersinforecastingaccuracy,itwasthecombinationofmethodsthatproducedbetterresults.
ThisleadtotheincreasedinterestinNNwhichperformedbetterthattimeseriesmethods,especiallyduetoitscapacitytodealwithnonlinearrelationshipbetweenpredictorsandpredictedvariables.
A 2014 study tested the accuracy of neural network ensemble prediction when compared withtraditionalmachine learningmethods and traditionalmathematical statisticsmethods for studyingChina’s inbound tourism market in which neural network ensemble had clear better predictivecapabilities(Bo&Shi-Ting,2014).
TheNNscapacitytoemulatethehumanbrainto identifypatterns inhistoricaldataandlearnfromexperiencetocapturefunctionalrelationshipsamongdatawhentheunderlyingprocessisunknown(Claveria,Monte,&Torra,2015)alsoleadstoitsshortcomingsofbeingofpoorcomprehensibilitydueits“blackboxmodel”,helpfulforpredictingresultsbut lackingincomprehendingthenatureofthestudyinquestionduetonotallowinginterpretabilityoftheoutcomesandcoefficients.
In2015’sstudy“Commontrendsininternationaltourismdemand:Aretheyusefultoimprovetourismpredictions?”(Claveriaetal.,2015)researchersmodelledtourismdemandincorporatingthecommontrends in international tourist arrivals from all visitor markets to a specific destinationandanalyzedwhethertheapproachallowedimprovingtheforecastingperformanceofNNmodels.Theyused threeNNs: themulti-layerperceptronnetwork (MLP), the radialbasis functionnetwork(RBF)andtheElmannetwork.
In the study “Univariateversusmultivariate time series forecasting: anapplication to internationaltourismdemand”(duPreez&Witt,2003)itwasfoundthatunivariatetimeseriesmodelswerenotsurpassedbymultivariatetimeseriesmodelsformoreaccurateforecasting.
6
Experimentalresultsdemonstratedthattheforecastingefficiencyofaneuralnetworkissuperiortothat of multiple regression, naive, moving average and exponential smoothing. This indicates thefeasibilityofapplyinganeuralnetworkmodeltopracticalinternationaltourismdemandforecasting(Law&Au,1999).
Uber, a global car-hailing andmobility technology company based in San Francisco, United Statesconductedastudytocreateamodel for forecastingextremeevents.Theyusedthousandsof timeseriesbasedonLongShort-TermMemory(LSTM)totrainamulti-moduleneuralnetwork.TheychoseLSTMduetoitscapacityofmodellingcomplexnonlinearfeatureinteractionsthroughworkingwithlarge amounts of data across numerous dimensions and use of external variables and automaticfeatureextraction(Laptevetal.,2017).
AI can identify patternsor irregularities thatwouldotherwise stayhidden.MLhas the capacity ofspotting opportunities that can make the difference. Its value is exponentially increased whencombinedwithhumananalysis,thatiswheninsightandvisionenabledatatotakeformandtranslateintoactionablestrategies.
7
3. METHODOLOGY
ForthisstudyIdevelopedthefollowingmethodologyanddiagram,illustratedinfigure1.Itdeliversaforecastingmethodthatwillenableaccuratepredictionsbasedonpastandfutureeventsaffectingroombookings.Themethodologyiscomposedofthefollowingeightstages:
§ Objectives
§ Datasources
§ DescriptiveAnalysis
§ FeatureEngineering
§ Modelling
§ ModelsFineTuning(ResultsandRevalidation)
§ ResultsAnalysis
§ Automationadvice
Figure1-Workflowdiagram
8
3.1. PROBLEMDEFINITION
Toconducteffectiveforecasting,oneneedstoensuretheproblemdefinitionhasbeenfullyexploredanddefined.Theproblemdefinitiondescribedintheprevioussection,2.1,servesasacompassforthefollowingstepsandthesuccessfulevaluationofthemethodologyandtheobtainedresults.
3.2. DATASOURCES
Thissectionprovidesanoverviewofthedata,thepre-processingandcleaningstepstaken,aswellasfeatureselectionandengineering.
DatawasretrievedfromKaggle’sExpediacompetition(Kaggle,2015).Thedatacoversusers’searchandbookingdatafrom2013and2014,620440records.Thisincludesbothclickandbookingevents.
The competition goal was to predict booking outcomes for a user event, based on their searchpatterns.
Thisdatawasselectedduetotherelevantfeaturesthatenabletheuseofdifferentmodelsandproperprediction outcomes. The dataset contains features that provide general information about theusersuchasID,users’continentwhenbookingandthedestinationofthebooking.Italsoshowstheusers’behaviourwithcheckinandcheckoutdatesandtimeofbooking.Aswellasifatthetimeofbooking therewasor not a promotionbeingdisplayed. The variables explanation canbe found inAppendix2,table5.
3.3. DATASAMPLINGANDCLEANING
Fromthe largesetofdatacollectedfromKaggle’scompetition,only8,4%wereactualbookings,toovercomethisissueIrebalancedthesampletoincreasetheinstanceswheretherewasabooking.
Iremovedtherowswithmissingvaluesinorigin-destinationdistance,whichamountedto36,8%ofthedata.Aftercleaningthemissingvalues,Iwasleftwith409602rowsofdata,enoughforthisproject.Missingvaluescancomplicate theanalysisof thestudy. I chose todropthedata rowsrather thanimpute valuesbecause it accounted to36,8%of thedata,which is ahighpercentageof values toimputeandcouldinthefutureformulatewrongpredictions(Kang,2013).
3.4. SOFTWARE
Dataikuversion4.1wasthesoftwareusedinthisstudy.ItenablestheuseofMachineLearninginaclearmanner.IwasabletobuildandoptimizethemodelsinPythonwhichallowsforseamlessfutureintegrationsthroughanAPItoexternallibraries.
Dataiku enables the creation, training and deployment of advanced custom Machine LearningModelsthroughtheuseofPythonorR.ForthisstudyIchosePython.
3.5. SELECTINGANDFITTINGMODELS
Different approacheswere tested to be able to deal effectivelywith the different data types anddimensions. Flexibility and scalability are essential for this study andmodel developmentwhich ispossiblewiththeuseofDataiku.
9
InthisstudyItrainedthedatawithLogisticRegression,RandomForest,XGBoostandNeuralNetworks.
Ichosethesemodelsduetotheircharacteristics,asfollows;LogisticRegression’soutcomesenableinterpretationwhichwasessentialtodeterminewhichvariablesaremostrelevantforthepurposeofthisstudy.RandomForestisaneasytousemodelwhichiscapableofdealingwithmultiplevariablesinaneffectiveway.Itiswidelyusedduetoitssimplicityandthefactthatitcanbeusedforclassificationandregressiontasks.NeuralNetworksarecomplexuninterpretablemodelsthathavegainedtractioninrecentyearsduetotheircapacitywithdealingwithmultiplevariablesandprovideveryaccuratepredictionswithverylargedatasets.XGBoostisdesignedforspeedandperformanceandoutperformsvariousmodelsinitspredictivecapabilities.IthasbeenusedinseveralKagglecompetitionsproducinggreat results andwinning several competitions. XGBoostwas engineered to have amore efficientcomputingtimeandusememoryresources inanoptimalway. Inrecentyears,datascientistshaveusedthesemodelstosuccessfullypredictoutcomessimilartothoseofthisstudy(SunilRay,2017).Amorein-depthexplanationofeachmodelcanbefoundinchapter3.8.
Thedatasetwasdividedinto3cohorts,1,2andacohortwithallthefeatures,eachcohorthasavaryingnumberoffeatures,refertoTable2inResultsandDiscussion.Iwantedtofindifaddingmorefeaturesinthetrainingwouldimpactthemodels’results.
3.6. VALIDATINGRESULTS
Oncethemodelsweretestedandresultsretrieved,thesewillneedtoberevalidatedtoensurethatthedevelopedmodelsareaccurateandcanbetestedwithnewsetsofdata,continuouslyproducingqualityresults.Seeingiftheresultscanbegeneralized.Iusedtheseparatesubsetofthedatasettovalidate if themodelwouldoverfitwiththedata I trained itwith.RefertoTable1 forthemodels’hyperparameters.
Table1-Hyperparameters
ForthisprojecttheassessmentmetricwasROCAUCsinceIwantedtohavepredictionsoptimisedfortrue positives versus false positives. It enables seeing how the model performs at categorisingoutcomes.
3.7. FEATUREENGINEERING
Somefeaturescombinedwithotherscanprovideinsightsintothedatatheyaimtorepresent.
Timeofsearchandtimeofbookingwereparsedinordertoprovideaclearerunderstandingofsearchandbookingtimes.
10
Date_timedatawasparsedandthedatecomponentswereextractedintoYear,Month,Day,DayofWeekandHour.Inordertoseeifthereisatrendinmonthordayoftheweek.
Thesameprocesswasdoneforsrch_ciandsrch_co,check-inandcheck-outdates. Icomputedthetimedifferencebetweendate_timeandcheck-inandthedifferencebetweencheck-inandcheck-outdates.BinnedthesearchMonthsintoQuartersandconductedthesameprocessforthecheck-inandcheck-outmonths.
Also,binnedthesearchhourin4binswith6hourseach.Midnightto6am,6amto12pm,12pmto6pmand6pmtomidnight.
The target searchmonthwas extracted from the search date and the frombookingwindow. Thisfeaturecanproduceinsightsonhowcertainpropertiesmightbemoredesirablethanothersincertainmonths. This would enable confirming if a property in for example, Continent 3, would bemoresearchedforinJanuaryforstaysinSeptember,enablingthenbetterpredictions.
3.8. MODELS
ThemodelsusedinDataikuwere;RandomForest,LogisticRegression,XGBoostandArtificialNeuralNetwork.
Startingwith the LogisticRegression, it is a classificationalgorithm thatuses a linearmodelwhichcomputes thetarget featureasa linearcombinationof input feature. It ispronetooverfittingandsensitive toerrors in the inputdataset. Still, theuseofa simple linearalgorithmcanbehelpful inexploringthedataandreachinginsightsregardingthedata’sunderlyingstructure.
InadditiontothislinearmodelIselected3non-linearmodelstobettercapturethecomplexityofthedata,whichwere:
RandomForest,composedofmanydecisiontrees.Whereeachtreepredictsanoutcome,affectingthefinalansweroftheforest.Itisanensemblelearningmethodforclassification,regressionandothertasks.ByusingRandomForestIavoidedtheoverfittingofdecisiontrees,whichispossiblebyhavingarandomelementthatenablesthatalltreesintheforestarenot identical. It lacksexplainabilitybutgenerallyprovidesgood results. It candealwith themultivariatedataandbyaveragingacross thebookingprobabilities,shouldhelppreventover-fittingbyindividualtrees(NiklasDonges,2018).
XGBoost(eXtremeGradientBoostingmethod),anadvancedgradienttreeboostingalgorithm,whichusesparallelprocessing,regularization(thathelpspreventoverfitting)andearlystopping,makesitafast,scalableandaccuratealgorithm.Beinganensemblelearningmethoditcombinesthepredictivepowerofmultiplelearners.Theresultisasinglemodelwhichgivestheaggregatedoutputfromseveralmodels. The models forming the ensemble can be either from the same or different learningalgorithms.Still,theyhavebeenmostlyusedwithdecisiontrees.Alltheadditivelearnersinboostingaremodelledaftertheresidualerrorsateachstep.Theboostinglearnersmakeuseofthepatternsinresidual errors. At the stage where maximum accuracy is reached by boosting, the residuals arerandomlydistributedwithoutanypattern(RamyaBhaskarSundaram,2018).
Ithasbeenusedinreal-worldproductionpipelinesforadclick-throughratepredictionandprovidedstate-of-the-artresultswithvariousotherproblemssuchasstoresalesprediction;customerbehaviourpredictionandwebtextclassification(Chen&Guestrin,n.d.).
11
ArtificialNeuralNetworks are inspiredby the functioningof neurons, consistentof several hiddenlayersofneuronswhichreceiveinputsandtransmittheseintothefollowinglayer.Itcandealwithnon-linearityallowingforcomplexdecisionfunctions. It lacks interpretabilityoffeatures importanceforthemodel’spredictiveoutcome.
Estimating feature importanceandmodel interpretability ingeneral isanareawhereHaldaretal.,took a step back with the move to NNs. Estimating feature importance is crucialin prioritizing engineering effort and guiding model iterations.The strength of NNs is in figuring out nonlinear interactions between the features. It is also theweakness when it comes to understanding what role aparticular feature is playing as nonlinear interactionsmake it very difficult to study any feature inisolation(Haldaretal.,n.d.).
Thetargetvariablewasifusersconductedabookingornot.
Allfeatureswereinputtedinthemodelexceptdate_time,user_id,srch_ciandsrch_co,sincethesehave direct correlation with the engineered features created which ultimately provide higherpredictionvalueandavoidoverfitting.
IoptimizedthemodelsforROCAUC,whichshowsmetheperformanceofthemodelsandthetruepositivesagainstthefalsepositivesrate.
TheDataikuflowshowninFigure2illustratesthestepstakeninthedataanalysisandmodellingstepsofthestudy.Thedatawasimported,cleanedandthendividedintotestandtraining.Thetrainingsetwastrainedwiththevariousvariablecohortsanddifferentmodels.
Finallytheresultsweredividedintwotoensurethattherewasnooverfittinginthedata.
Figure2-Dataikuworkflow
12
4. RESULTSANDDISCUSSION
4.1. DESCRIPTIVEANALYSIS
By considering the Booking windows between continents, there are clear differences betweencontinents. Considering continents 4 and 1, the booking window has significantvariancethroughouttheyear,butamuchsoftervarianceforcontinents0,2and3.
For continent 4, week 3 has the highest average bookingwindowwith 108 days. And the lowestbookingwindowforweek26withbookingsbeingmade82daysinadvance.Refertoappendix1,figure3.
TheseresultscanprovidehighlyvaluableactionableinsightstoanycompanyadvertisingforHospitalityorHospitalityrelatedproductsandservices.
Theaveragesearchwindowwidelyvarieswhenconsideringthebookingsthat includeapackageornot.Whenincludingapackagetheaveragebookingwindowisof75,8daysagainst44,6dayswithoutapackage.Refertoappendix1,figure4.
There is notmuchdifference in the searchwindowwhen thebooking includes aweekendornot,rangingfrom50to54daysinbothscenarios.Appendix1,figure5.
Whenbookingwithchildrenthebookingwindowdrasticallychangesthemorechildrenyouhave,thereisaclearpatternwithinthedata.Ifbookingwithoutchildrentheaveragebookingwindowisof50,66days,whilewhenthebookingincludesachildtheaveragegoesupto54,87andgoingupto65,43dayswhenthebookingincludes3children.
Still,thesenumbersmaynothavestatisticalevidenceduetothescarcenumberofbookingswithmorechildren.78,5%ofbookingsaremadewith0children,11%withonechild,8,6%with2andIcouldseeasteepdeclineinbookingswith3children,representingonly1,4%ofthetotalbookinginthedataset.Refertoappendix1,figure6and7.
Mostbookingsaremadeforcontinent2,with63,9%ofbookings.Iassumethereforethatcontinent2is likely to be North America, due to the population volume and high numbers of internal travel.Appendix1,figure8.
Thesenumbersalsoreflectthehighermarketpenetrationinforcontinent2ratherthantheremainingcontinents.
4.2. PREDICTIVEANALYSIS
I ran threedifferent feature cohorts through themodels, addingmore features toevery cohort, itprovedtobethatthecohortwithallthefeatureshadthebestperformance.
13
Table2-Cohorts’results
WhatIfoundisthatsomemodelsareconsistentlybetterthanothersinpredictinglikelihoodtobook.XGBoost continuously outperforms the other models. Also, the more variables I added the moreaccuratepredictionsthemodelswouldbeabletoachieve.Allresultscanbefoundinappendix2table6.
PleaserefertoTable3forthealgorithm’sdetails.
14
Table3-Algorithmdetails
Intermsofvariablesimportanceinpredictivepower,hotelcluster,hotelmarket,searcheddestination,if packageornot, searchdurationandorigin-destinationdistanceproved tobe themost relevant.Pleaserefertoappendix2,figure9.
Itwaspossibletoidentifyandbinwhichusersaremostlikelytobook,mediumandleastlikelytobook.Refertoappendix2figure10.Thisallowssegmentationofnewcustomerswhichcanleadtodifferentstrategiesforeachsegment,enablingmoreprecisetargetingandbettercampaignsand/orpromotionsperformance.
AsIshowbelowthemodelwasgeneralenoughtoobtaingoodpredictions,refertoTable4.TheROCAUCwas0,7833.
Table4-Qualitymetrics
15
4.3. PRESCRIPTIVEANALYSIS
AnextstepwouldbetoimplementanautomationworkflowusingintegrationsbetweenDataiku,theexistingdatabaseandtheContentManagementSystem(CMS)asillustratedinAppendix3,figure11.
TheseintegrationsarepossiblethroughZapier,aweb-basedservicethatenablesseamlessautomationworkflowsbetweendifferentapplications,itcanbedescribedasatranslatorbetweenwebAPIs.
It ispossibletoapplythemodeldevelopedwithpastdataandfeeditwithnewdatatopredictthelikelihoodofacustomerbooking.
Thisultimately leadstoreal-timewebsiteupdatesshowingthetargetedpromotionsor informationthatleadtheconsumertobuy.Andultimatelybettermarketingstrategies.
16
5. CONCLUSIONS
Thisprojecthasoutlinedtheprocessofbuildingamodeltopredicthotelbookings.TheanalyzeddataincludedsearchpatternsandbookingsfromExpedia’sdatasetduringtheperiod2013-2014.
I foundthatwecanpredictwhensomeonewouldbookornot,basedonsearchpatternsandthatvariablessuchasthenumberofchildren,orthelocationwherethebookingisbeingmadehaveanimpactonthebookingwindow.
I was capable of identifying the users that aremost likely and least likely to book, creating threedifferentsegmentswhichareverypowerfultogenerateeffectivestrategiestoimpacttherightuserswiththerightnudgesthatwillleadtomorebookings,hencemorerevenue.
Thisultimatelyallowsmarketersandhospitalityprofessionalsorofanyactivityrelatedtohospitalityto stir theirmarketing efforts inmore effectiveways, creating the right promotions ormarketing‘nudges’atthemostrelevanttimeinthesearchandbookingfunnel.
Featureselectingprovedtobeextremelyimportantforreachinggoodresults,asincludingafullsetoffeaturescancreatetoomuchnoiseandmakeitdifficulttofindunderlyingpatternsthatwillmakethemodel better. This was not the case for this project, where including the whole set of variablesproducedthebestmodeloutcomes.
XGBoostprovedtobethemosteffectivePredictiveModel,withthehighestROCAUC.Hencethemodelmostsuitedforthisanalysisandtrainingoffuturefeaturecohortsforsearchandbookingdata.
The objectives of this study were different from the objectives established within the Kagglecompetitionforthisdataset.Thisstudy’stargetwastopredictthelikelihoodofbooking,whilstthecompetitionwascreatedtodeterminethelikelihoodofauserstayingat1ofthe100hotelgroups.
This dataset was selected due to having the relevant attributes for this study’s purpose. Kaggles’winningmodelsareveryspecificandtailoredforthecompetitionandwhenpresentedwithothersetsofdatadonotperformaswell.
HotelsandHospitalityprofessionalsfrommultipleindustrieshavecompleteaccesstothisdata,thecomputationalpoweravailabletodayandthevariousfreetoolinganddatasetsavailableprovideallthenecessaryinstrumentstoconductmeaningfulanalysisthatcanpushhospitalityrelatedcompaniestotheforefront.Itallcomesdowntopropertimeallocationandone’swillingnesstotestandplaywiththeavailableresources.
17
6. LIMITATIONSANDRECOMMENDATIONSFORFUTUREWORKS
Futureworkscouldconsidercombiningexternalvariablesandextremeeventsandcrossingthesewiththedataathand,findingcorrelationswiththese,thatwillleadtobetterpredictivepower.
Automationofthemodel’spredictiveoutcomesshouldalsobeconsideredandfurtherexplainedinafutureproject.Allowingfordata-orientedmarketingstrategiesthatcansurelyoutperformtraditionalmarketingefforts.
Becausedatawasanonymised,Icouldnotseetheactualcontinentsandcountrieswherethebookingsweretakingplace.Thepriceswerealsounavailable.Meaningthatthedatahadpredictivepowerbutlackedinterpretabilitythatwouldenableactingupontheresults.
Infutureworks,amorecompletedataset,withconcretevalueswouldenablemuchbetterpredictionsandactionableoutcomes.Beingable to see theeffectofprice inbookingpatterns andhow smallvariationsinpricecaninfluencethepredictivepowerofthemodels.
It isalsoreasonabletoconsiderthatweatherconditionsanda location’seconomic/politicaland/orenvironmentalstabilityplayanimportantroleindeterminingpropensitytobookacertainpropertyonacertainlocation,leadingtochangeintourismdemand.Thesearefactorsthatdynamicallychangeinacontinuousway,henceamajorchallengewouldbeprovidinganacceptablemeasurementforthesefactorsinordertoincludetheminfuturepredictivemodelling.
Futurestudiesshouldalsoconsiderimplementingpsychographicstraitsandpersonastoprovidebetterpromotionstotheconsumersmostlikelytobuy.Offeringdifferentproductsanddifferentpackagesdependingontheusers’levelofextraversionorconsciousness,twotraitsthatprovedtobeaccuratepredictorsofuserspropensitytobuycertainproducts(Rauschnabel,Brem,&Ivens,2015).
Using recommender systems can also be highly valuable to accurately predict users’ purchasingbehaviour,henceshouldbeincludedinfuturepredictionstudies.
18
7. BIBLIOGRAPHY
Bangwayo-Skeete,P.F.,&Skeete,R.W.(2015).CanGoogledataimprovetheforecastingperformanceoftouristarrivals?Mixed-datasamplingapproach.TourismManagement,46,454–464.https://doi.org/10.1016/J.TOURMAN.2014.07.014
Bo,X.,&Shi-Ting,L.(2014).20147thInternationalConferenceonIntelligentComputationTechnologyandAutomation,IntelligentComputationTechnologyandAutomation(ICICTA),20147thInternationalConferenceon,IntelligentComputationTechnologyandAutomation,InternationalConf.https://doi.org/10.1109/ICICTA.2014.91
Chan,C.K.,Witt,S.F.,Lee,Y.C.E.,&Song,H.(2009).TourismforecastcombinationusingtheCUSUMtechnique.TourismManagement,31(6),891–897.https://doi.org/10.1016/j.tourman.2009.10.004
Chen,T.,&Guestrin,C.(n.d.).XGBoost:AScalableTreeBoostingSystem.Retrievedfromhttps://github.com/dmlc/xgboost
Claveria,O.,Monte,E.,&Torra,S.(2015).Commontrendsininternationaltourismdemand:Aretheyusefultoimprovetourismpredictions?TourismManagementPerspectives,16,116–122.https://doi.org/10.1016/J.TMP.2015.07.013
Claveria,O.,&Torra,S.(2014).ForecastingtourismdemandtoCatalonia:Neuralnetworksvs.timeseriesmodels.EconomicModelling,36,220–228.https://doi.org/10.1016/J.ECONMOD.2013.09.024
Constantino,H.,Fernandes,P.O.,&Teixeira,J.P.(2016).ModelaçãodaProcuraTurísticaparaMoçambiqueIVCongressoInternacionaldeTurismodaESG/IPCATourismforthe21stCenturyModelaçãodaProcuraTurísticaparaMoçambiqueJoãoPauloTeixeira,(July).
duPreez,J.,&Witt,S.F.(2003).Univariateversusmultivariatetimeseriesforecasting:anapplicationtointernationaltourismdemand.InternationalJournalofForecasting,19(3),435–451.https://doi.org/10.1016/S0169-2070(02)00057-2
Ferreira,A.(2017).Expresso|Turismo.Portugaléo14.omaiscompetitivodomundo.RetrievedJune13,2017,fromhttp://expresso.sapo.pt/economia/2017-04-06-Turismo.-Portugal-e-o-14.-mais-competitivo-do-mundo
Haldar,M.,Abdool,M.,Ramanathan,P.,Xu,T.,Yang,S.,Duan,H.,…Legrand,D.(n.d.).ApplyingDeepLearningToAirbnbSearch.Retrievedfromhttps://arxiv.org/pdf/1810.09591.pdf
Hudson,S.,&Thal,K.(2013).TheImpactofSocialMediaontheConsumerDecisionProcess:ImplicationsforTourismMarketing.JournalofTravel&TourismMarketing.https://doi.org/10.1080/10548408.2013.751276
INE.(2017).PortaldoInstitutoNacionaldeEstatística.RetrievedJune13,2017,fromhttps://www.ine.pt/xportal/xmain?xpgid=ine_main&xpid=INE
JoanaNunesMateus.(2017).UmQuartodosPortuguesesVaiTrabalharParaoTurismo|ExpressoEmprego.RetrievedJuly4,2017,fromhttp://portugalarecrutar.expressoemprego.pt/noticias/um-quarto-dos-portugueses-vai-trabalhar-para-o-turismo/4353
JulieCave.(2016).DigitalMarketingVs.TraditionalMarketing:WhichOneIsBetter?-Digital
19
Doughnut.RetrievedJuly10,2017,fromhttps://www.digitaldoughnut.com/articles/2016/july/digital-marketing-vs-traditional-marketing
Kaggle.(2015).ExpediaHotelRecommendations.Retrievedfromhttps://www.kaggle.com/c/expedia-hotel-recommendations
Kang,H.(2013).Thepreventionandhandlingofthemissingdata.KoreanJournalofAnesthesiology,64(5),402–6.https://doi.org/10.4097/kjae.2013.64.5.402
KevinMay.(2017).Googlecanrejoice:PricelineGroupspent$3.5billiononPPCin2016.RetrievedJune1,2017,fromhttps://www.tnooz.com/article/priceline-group-3-5-billion-advertising-2016/
Laptev,N.,Smyl,S.,&Shanmugam,S.(2017).EngineeringExtremeEventForecastingatUberwithRecurrentNeuralNetworks-UberEngineeringBlog.RetrievedJune18,2017,fromhttps://eng.uber.com/neural-networks/
Law,R.,&Au,N.(1999).AneuralnetworkmodeltoforecastJapanesedemandfortraveltoHongKong.Retrievedfromhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.4253&rep=rep1&type=pdf
Misrahi,T.,&Crotti,R.(2018).TheTravel&TourismCompetitivenessReport2017Pavingthewayforamoresustainableandinclusivefuture.
Mukherjee,S.,&Lakshmanan,L.(2017).GoogleCloudprovidesaunified,streamlinedwaytoexecuteyourMLstrategy|GoogleCloudBigDataandMachineLearningBlog|GoogleCloudPlatform.RetrievedFebruary2,2018,fromhttps://cloud.google.com/blog/big-data/2017/11/google-cloud-provides-a-unified-streamlined-way-to-execute-your-ml-strategy
NiklasDonges.(2018).TheRandomForestAlgorithm.RetrievedJanuary3,2019,fromhttps://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd
Pai,P.-F.,Hung,K.-C.,&Lin,K.-P.(2014).Tourismdemandforecastingusingnovelhybridsystem.ExpertSystemswithApplications,41(8),3691–3702.https://doi.org/10.1016/J.ESWA.2013.12.007
PatrickWhyte.(2017).SmartCitiesNeedOpenDataandaWillingnesstoTestandLearn.RetrievedJune20,2017,fromhttps://skift.com/2017/06/15/smart-cities-need-open-data-and-a-willingness-to-test-and-learn/?utm_campaign=SkiftWeeklyReviewNewsletter&utm_source=hs_email&utm_medium=email&utm_content=53244701&_hsenc=p2ANqtz--3Fu3wJdXe_S4rGD8KhVQjb-vWPMvXMODGF9d-
RamyaBhaskarSundaram.(2018).UnderstandingtheMathbehindtheXGBoostAlgorithm.RetrievedJanuary6,2019,fromhttps://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/
Rauschnabel,P.A.,Brem,A.,&Ivens,B.S.(2015).Whowillbuysmartglasses?Empiricalresultsoftwopre-market-entrystudiesontheroleofpersonalityinindividualawarenessandintendedadoptionofGoogleGlasswearables.ComputersinHumanBehavior,49(May),635–647.https://doi.org/10.1016/j.chb.2015.03.003
SunilRay.(2017).EssentialsofMachineLearningAlgorithms(withPythonandRCodes).RetrievedJanuary6,2019,fromhttps://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
20
Torres,E.N.,Singh,D.,&Robertson-Ring,A.(2015).Consumerreviewsandthecreationofbookingtransactionvalue:Lessonsfromthehotelindustry.InternationalJournalofHospitalityManagement.https://doi.org/10.1016/j.ijhm.2015.07.012
TrefisTeam.(2016).HereAreTheKeyGrowthAreasForPriceline’sBooking.com-Nasdaq.com.RetrievedJuly12,2017,fromhttp://www.nasdaq.com/g00/article/here-are-the-key-growth-areas-for-pricelines-bookingcom-cm723530?i10c.referrer=https%3A%2F%2Fwww.google.pt%2F
Wu,L.,Mattila,A.S.,&Hanks,L.(2015).Investigatingtheimpactofsurpriserewardsonconsumerresponses.InternationalJournalofHospitalityManagement,50,27–35.https://doi.org/10.1016/j.ijhm.2015.07.004
Zhang,L.(2016).Pricetrendsforcloudcomputingservices.WellesleyCollege.Retrievedfromhttp://repository.wellesley.edu/thesiscollection/386
21
8. APPENDIX
8.1. APPENDIX1–DESCRIPTIVEANALYSIS
Figure3-Searchwindowperweekpercontinent
Figure4-Searchwindowwhenbookingwithpackageorwithoutpackage
22
Figure5-Searchwindowifbookingincludesweekendordoesnotincludeweekend
Figure6-Searchwindowwhenbookingwithchildren
23
Figure7-Numberofbookingsincludingchildren
Figure8-Numberofbookingspercontinent
24
8.2. APPENDIX2–PREDICTIVEANALYSIS
Table5-VariablesExplanation
Table6-Models'results
25
Figure9-Variablesimportance
Figure10-Bookinglikelihood
26
8.3. APPENDIX3-AUTOMATION
Figure11-Automationworkflow