IntelligentEventFocusedCrawling
MohamedMagdyGharibFarag
DissertationsubmittedtothefacultyoftheVirginiaPolytechnicInstituteandStateUniversityinpartialfulfillmentoftherequirementsforthedegreeof
DoctorofPhilosophy
inComputerScienceandApplications
EdwardA.Fox,Co-ChairRihamH.Mansour,Co-Chair
WeiguoFanScotlandC.LemanPadminiSrinivasan
August11,2016Blacksburg,VA
Keywords:FocusedCrawling,EventModeling,WebArchives,DigitalLibraries,
SourceImportance,SocialMedia,SeedGeneration
Copyright2016MohamedM.G.Farag
ii
IntelligentEventFocusedCrawling
MohamedMagdyGharibFarag
ABSTRACTThereisneedforanintegratedeventfocusedcrawlingsystemtocollectWebdataaboutkeyevents.Whenaneventoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents.Weproposeintelligenteventfocusedcrawlingforautomaticeventtrackingandarchiving,aswellaseffectiveaccess.Weextendthetraditionalfocused(topical)crawlingtechniquesintwodirections,modelingandrepresenting:eventsandwebpagesourceimportance. Wedevelopedaneventmodelthatcancapturekeyeventinformation(topical,spatial,andtemporal).Weincorporatedthatmodelintothefocusedcrawleralgorithm.Forthefocusedcrawlertoleveragetheeventmodelinpredictingawebpage’srelevance,wedevelopedafunctionthatmeasuresthesimilaritybetweentwoeventrepresentations,basedontextualcontent.Althoughthetextualcontentprovidesarichsetoffeatures,weproposedanadditionalsourceofevidencethatallowsthefocusedcrawlertobetterestimatetheimportanceofawebpagebyconsideringitswebsite.Weestimatedwebpagesourceimportancebytheratioofnumberofrelevantwebpagestonon-relevantwebpagesfoundduringcrawlingawebsite.Wecombinedthetextualcontentinformationandsourceimportanceintoasinglerelevancescore.Forthefocusedcrawlertoworkwell,itneedsadiversesetofhighqualityseedURLs(URLsofrelevantwebpagesthatlinktootherrelevantwebpages).AlthoughmanualcurationofseedURLsguaranteesquality,itrequiresexhaustivemanuallabor.WeproposedanautomatedapproachforcuratingseedURLsusingsocialmediacontent.WeleveragedtherichnessofsocialmediacontentabouteventstoextractURLsthatcanbeusedasseedURLsforfurtherfocusedcrawling.Weevaluatedoursystemthroughfourseriesofexperiments,usingrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattack,andOregonshooting.Inthefirstexperimentseriesourproposedeventmodelrepresentation,usedtopredictwebpagerelevance,outperformedthetopic-onlyapproach,showingbetterresultsinprecision,recall,
iii
andF1-score.Inthesecondseries,usingharvestratiotomeasureabilitytocollectrelevantwebpages,oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch).Thethirdseriesevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportance,butfromasmallersetofsources.ThefourthseriesprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.
iv
AcknowledgmentsFirstIwouldliketothankmyadvisorsDr.EdwardA.FoxandDr.RihamMansourforallthecontinuoushelp,support,encouragement,patience,motivation,andguidancethattheygavemethroughmyPh.D.journey.IcouldnothaveimaginedhavingbetteradvisorsandmentorsformyPh.D.study. Besidesmyadvisors,Iwouldliketothanktherestofmythesiscommittee-Prof.PadminiSrinivasan,Prof.PatrickWeiguoFan,andProf.ScotlandLeman-fortheirinsightfulcommentsandencouragement,butalsoforthehardquestionswhichmotivatedmetowidenmyresearchfromavarietyofperspectives.MysincerethanksalsogoestoSethPeery,LukeWard,ShaneColeman,andAndiOgier,whoprovidedmeanopportunitytojointheirteamasintern,andwhohelpedmelearnnewtechnologies,extendmyskills,andapplymyresearchexperiencetopracticalproblems.Ithankmyfellowlabmates:SeungwonYang,SunshinLee,VenkatSrinivasan,TarekKanan,SungHeePark,MonicaAkbar,KiranChitturi,PrashantChandrasekar,EricFouh,andallotherlabmembersIforgottomention,forthestimulatingdiscussions,forthesleeplessnightswewereworkingtogetherbeforedeadlines,andforallthefunwehavehadinthelastfiveyears.IwouldliketothankmywifeSamarElSaadawy.Nowordsorthankswouldsufficeorgiveherwhatshedeserves.Shesufferedalotforme.ShesupportedmeinallthedifferentstagesofmyPh.D.withoutcomplaining.Withouther,Iwouldn’thavedoneanything.ThankyoumyheroSamarElSaadawy. Lastbutnotleast,Iwouldliketothankmyfamily-myparentsandmybrotherandsisters-forsupportingmespirituallythroughoutmyPh.D.andmylifeingeneral.ThanksgotoNSFforsupport,especiallythroughgrantsIIS-1619028,IIS-1619371,IIS-1319578,DUE-1141209,IIS-0916733,andIIS-0736055.ThanksalsogotoVirginiaTech’sDigitalLibraryResearchLaboratoryandDepartmentofComputerScience.
v
TableofContentsABSTRACT....................................................................................................................................................iiAcknowledgments....................................................................................................................................ivTableofContents.......................................................................................................................................vListofFigures............................................................................................................................................viiListofTables...............................................................................................................................................ix1 Introduction.......................................................................................................................................11.1 Motivation.................................................................................................................................11.2 Hypotheses................................................................................................................................51.3 ResearchQuestions...............................................................................................................61.4 5S...................................................................................................................................................61.5 ThesisOrganization..............................................................................................................8
2 RelatedWork.....................................................................................................................................92.1 WebCrawling...........................................................................................................................92.2 TheIDEALproject..................................................................................................................92.3 WebArchivingandArchive-Itservice.......................................................................112.4 FocusedCrawling................................................................................................................132.4.1 MachineLearning......................................................................................................132.4.2 SemanticSimilarity...................................................................................................142.4.3 ContentandLinkAnalysis.....................................................................................15
2.5 EventModeling....................................................................................................................162.6 SocialMediaandFocusedCrawlingSeedSelection.............................................182.7 Evaluatingtopical/focusedcrawlers..........................................................................19
3 FocusedCrawling..........................................................................................................................203.1 TopicRepresentation........................................................................................................203.2 CrawlerArchitecture.........................................................................................................223.3 LargeScaleDesignConsiderations..............................................................................24
4 EventFocusedCrawler...............................................................................................................264.1 EventModelandRepresentation.................................................................................264.1.1 EventModeling...........................................................................................................26
4.2 EventProcessing.................................................................................................................304.2.1 EventModel-basedWebpageScoring..............................................................304.2.2 CalculatingTheWeights.........................................................................................324.2.3 EventModel-basedURLScoring.........................................................................34
5 ExperimentalSetup......................................................................................................................375.1 Datasets...................................................................................................................................375.2 Experiments..........................................................................................................................395.3 EvaluationMetrics..............................................................................................................41
6 Results...............................................................................................................................................436.1 EventModel-basedvs.Topic-OnlyClassification..................................................436.1.1 ClassifyingURLsandwebpagesaboutCaliforniashooting....................43
6.2 EventModel-basedvs.Topic-onlyFocusedCrawler...........................................486.2.1 CaliforniaShooting...................................................................................................486.2.2 BrusselsAttack...........................................................................................................506.2.3 Oregonshooting.........................................................................................................53
vi
6.2.4 Egyptairplanecrash................................................................................................546.2.5 Panamapapers...........................................................................................................556.2.6 Orlandoshooting.......................................................................................................556.2.7 Parisattacks.................................................................................................................566.2.8 Ecuadorearthquake.................................................................................................57
7 WebpageSourceImportanceandSocialmedia-basedSeedSelection..................597.1 WebpageSourceImportance.........................................................................................597.2 SeedURLsforcrawling.....................................................................................................647.3 Semi-automatedSocialMedia-basedSeedURLGeneration.............................657.3.1 SelectingSeedURLs..................................................................................................667.3.2 SeedsURLDomain/SourceImportance..........................................................66
8 ConclusionandFutureWork...................................................................................................748.1 Contributions........................................................................................................................758.2 FutureWork..........................................................................................................................75
References.................................................................................................................................................77
vii
ListofFiguresFigure1OverviewofIDEALsystemandroleofeventfocusedcrawling.........................2Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)................10Figure3SeedURLsmanualcurationusingArchive-Itservice...........................................13Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthe
lowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationinthemiddlebox...........................................................................................................................22
Figure5Baselinefocusedcrawleralgorithm.............................................................................24Figure6Stepsofbuildingeventmodelfromseedwebpages.............................................28Figure7Thestepsforcalculatingthescoreofawebpage...................................................33Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted..............34Figure9ThestepsforcalculatingthescoreofaURL.............................................................35Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel
forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment...........................................................................40
Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofeventmodelwithfocusedcrawlertoretrievemorerelevantwebpages.........................41
Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues.............46Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues...47Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused
crawlersforCaliforniashooting.............................................................................................49Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforCaliforniashooting(50Kwebpages).........................................50Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor
Brusselsattack...............................................................................................................................51Figure17Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforBrusselsattack(50Kwebpages).................................................52Figure18Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforOregonshooting(100Kwebpages)...........................................53Figure19Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforEgyptairplanecrash(10Kwebpages).....................................54Figure20Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforPanamaPapers(100Kwebpages).............................................55Figure21Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforOrlandoshooting(50Kwebpages)............................................56Figure22Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforParisattack(500Kwebpages).....................................................57Figure23Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforEcuadorearthquake(100Kwebpages)...................................58Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack
event...................................................................................................................................................62Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia
shootingevent................................................................................................................................62Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador
earthquakeevent..........................................................................................................................63
viii
Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlandoshootingevent................................................................................................................................63
Figure28Workflowforextracting,expanding,andselectingURLsfromtweets......67Figure29Brusselsattacktweetslanguagedistribution.......................................................68Figure30Brusselsattacktweetscreationdatedistribution...............................................69Figure31BrusselsattacktweetswithURLsdistribution.....................................................69Figure32BrusselsattackseedURLsdomainsdistribution................................................70
ix
ListofTablesTable1Exampleeventsofdifferenteventtypes.........................................................................3Table2Listofeventsusedinthelarge-scalecrawlingexperiments...............................38Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe
topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore........................................................................................43
Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCaliforniashootingevent................................................................................................................................44
Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent...........................................................................................................45
Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCaliforniashootingevent...........................................................................................................46
Table7 Californiashootingeventmodel......................................................................................48Table8 Brusselsattackeventmodel..............................................................................................51Table9DifferenttypesofseedURLs.............................................................................................65Table10Brusselsattacktweetcollectionstatistics................................................................68Table11Harvestratioforeventfocusedcrawlerusingtwomethodsofseed
selectionwithdifferentnumbersofseedsforBrusselsattack.................................71Table12Numberofdifferentdomainsintheoutputcollectionsofcrawling
experimentsforBrusselsattack.............................................................................................71Table13Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds
selectionwithdifferentnumbersofseedsforOregonshooting..............................71Table14Numberofdifferentdomainsintheoutputcollectionsofcrawling
experimentsforOregonshooting..........................................................................................72Table15Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds
selectionwithdifferentnumbersofseedsforCaliforniashooting.........................72Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling
experimentsforCaliforniashooting.....................................................................................72Table17Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds
selectionwithdifferentnumbersofseedsforOrlandoshooting............................73Table18Numberofdifferentdomainsintheoutputcollectionsofcrawling
experimentsforOrlandoshooting........................................................................................73
1
1 Introduction
1.1 MotivationThereisneedforanintegratedeventfocusedcrawlingsystem[1-3]tocollectWebdataaboutkeyevents.Eventsleadtoourmostpoignantmemories.Werememberbirthdays,graduations,holidays,weddings,andothereventsthatmarkstagesofourlife,aswellasthelivesoffamilyandfriends.Asasocietywerememberassassinations,naturaldisasters,man-madedisasters,politicaluprisings,terroristattacks,andwars--aswellaselections,heroicacts,sportingevents,andothereventsthatshapecommunity,national,andinternationalopinions.WebandTwittercontentdescribesmanyofthesesocietalevents.Inpart,Web2.0[4]isahighlyresponsivesensorofimportantoccurrencesintherealworld,sincepeoplefromacrosstheglobemeetvirtuallyandsharerelatedobservationsandstoriesonline.Wecanleveragethisstreamofdata,forautomaticcollectionofevents,totriggereventarchiving,andlatertoenableeventrelatedservicesthatsupportcommunities.Permanentstorageandaccesstobigdatacollectionsofeventrelateddigitalinformation,includingwebpages,tweets,images,videos,andsounds,couldleadtoanimportantinternationalasset.Regardingthatasset,thereisneedfordigitallibraries(DLs)providingimmediateandeffectiveaccess,andarchiveswithhistoricalcollectionsthataidscienceandeducation,aswellasstudiesrelatedtoeconomic,military,orpoliticaladvantage.Whensomethingnotableoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Later,researchers,scholars,students,andothersseekinformationaboutsimilarevents,sometimesforcross-eventcomparisonsortrendanalyses.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents,exceptwhennationalorstateeventsarecapturedaspartofgovernmentrelatedWebarchives.ThisistheneedaddressedbytheIntegratedDigitalEventArchiveandLibrary(IDEAL)project[5].ThoughtheInternetArchive[6]supportssomeevent-orientedarchiving,coverageislimited.Manyimportanteventsareignored,whileothersareonlycapturedinpart.Somegroupscollectdataonlyuntilthelastvictimisrescued,whileothersstartlateandmissearlyposts.Further,toolsforcapturearecomplex,andfewarchivists
2
mastertheirfeatures,soachievinghighrecallisexpensive.Therearefewmechanismstofilteroutnoiseincollections.Accesstotheresultingarchivesisawkwardandinefficientduetothefactthatmuchofthecontentcapturedisnon-relevant[7].WearguethatmanualcurationofseedURLsisnotscalableandnotfullyeffectiveforarchivingeventsthathavehighimpact.Thus,improvedtechnologyisneeded.TheIDEALprojectisdevelopingadigitallibrary/archivesupportingautomaticeventtracking,crawling,andarchiving,aswellaseffectiveaccess(inthesenseofaidinginthefindingandutilizationofrelevanthighqualityinformation).Figure1showsanoverviewoftheIDEALproject.Bytakinginputfromtweets,news,webpages,(micro)blogs,andqueries,oursystemwillcollectandarchiveeventrelateddigitalobjects,andprovideabroadrangeofhelpfulservices.
Figure1OverviewofIDEALsystemandroleofeventfocusedcrawling
ThisdissertationfocusesonthedatafrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.TheIDEALprojecthasaround11TBofwebpagearchives(WARCfiles)andover1.2billiontweetsacrosshundredsofdifferentevents[8].Earlyon,thewebpagearchiveswerecollectedusingtheInternetArchive’s[6]Archive-Itservice[9],whichusestheHeritrix[10]toolforarchivingwebpages.Originally,theIDEALprojectmanuallypreparedalistofURLsforevents,andfedittotheArchive-Itserviceforcrawling.Theproblemwiththisweaklycuratedapproachisthatweproducedcollectionswithlowprecision(i.e.,withfewrelevantandmanynon-relevantwebpages);HeritrixisageneralWebcrawleranddoesn’tanalyzethetextualcontentofthewebpagesbeforedownloadingthem.Toovercomethisproblem,theIDEALteamshiftedtoanotherapproach:weextractedURLsfromtweetarchivesthatwerebuiltaboutevents,and
3
downloadedonlythecorrespondingwebpages.Theresultingcollectionshavehighprecision(mostofthewebpagesarerelevant)butlowrecall(notalloftherelevantwebpagesarefound).Afocusedcrawlerwouldhelpsolveeachofthepreviousproblemsbycrawling(toincreaserecall)theWWWstartingfromtheURLsextractedfromthetweetarchivesbutthenfollowingonlytherelevantwebpages(toenhanceprecision)inordertofindandcollectasmuchrelevantinformationaspossible.However,inthepast,focusedcrawlingwasmainlyappliedtotopicalcrawling,i.e.,collectingwebpagesaboutacertaintopicordomain.Accordingly,sinceweareworkingwithevents,weextendedtheapproachpreviouslyusedbytheIDEALteamandadapted/changedthetraditionalfocusedcrawlerapproachtoaccommodateourneeds.
Table1Exampleeventsofdifferenteventtypes
EventTypes ExampleEvents
Bombing Bostonbombing
BuildingCollapse EastHarlembuildingcollapse
Community Flintwatercrisis,Lovewins(samesexmarriage),Worldcup
Earthquake Ecuador,Japan
Fire Californiawildfire,Brazilnightclubfire
Flood TexasFloods
Hurricane Joaquin,Sandy,Katrina
PlaneCrash Egyptair,Russian,germanwings
PoliticalConflict Brexit,Turkeycoup,GreeceBailoutreferendum
Protests/Riots Ferguson,Egyptianrevolution
Scandal PanamaPapers,SeppBlatter
Shooting Oregoncollegeshooting,Californiashooting,Orlandoclubshooting
TerroristAttack Paris,Brussels,Nicetruckattack
TrainDerailment Amtrak188,Quebec
4
Thereareseveraldefinitionsofaneventthatdifferaccordingtodiscipline(seeChapter4formoredetails).Inthisdissertationweareinterestedinunusualrealworldeventsthatcapturemuchattention,leadingtoahighvolumeofcontentgeneratedontheWWWabouttheevent.Thusweconsiderdifferenttypesofeventslike:airplanecrashes,buildingcollapses,earthquakes,floods,fires,hurricanes,politicalcrises,scandals,shootings,terroristattacks,andtraincrashes/derailments.Allthesetypesofeventssharethesamecharacteristics:somethinghappensataspecificphysicallocationduringaspecificperiodoftime.ExampleeventsforeacheventtypeareshowninTable1.Thisisnotanexclusivelistofeventsandtheirtypes,butratherarepresentativelistofwhatwehavebeenworkingoninthisdissertation.Ourresearchshouldgeneralizetoanyeventwiththesamecharacteristics(i.e.,somethingunusualhappensataspecificplaceonaspecificdate).Weproposedfivemajorchangestotraditionaltopicalfocusedcrawling:
1. Implementinganeventmodelandrepresentation, 2. Incorporatingtheeventmodelinformationextractedfromseedwebpages’
contentintofocusedcrawling, 3. Developingawebpagesourceimportancemodel, 4. Incorporatingthewebpagesourceimportancemodelintofocusedcrawling, 5. AutomatingtheprocessofseedURLsselection.
OurproposedapproachintelligentlycombinesthedifferentaspectsofaneventtoheuristicallyestimatetherelevancetotheeventofaURLand/orawebpage.Ourintelligenteventfocusedcrawlerdistinguisheson-topicURLs/webpagesvs.off-topic(likewiththebaselinetopic-onlyapproach),andalsodistinguisheson-eventURLs/webpagesvs.off-eventURLs/webpagesonwhichthebaselinetopic-onlyapproachfails.Forexample,considertheCaliforniashootingevent.Bothapproaches(ourintelligenteventfocusedcrawlerandthebaselinetopicalfocusedcrawler)successfullyidentifyon-topicURLs/webpages(i.e.,URLs/webpagesaboutashooting).However,oureventfocusedcrawleralsointelligentlydistinguishesbetweenURLs/webpagesthatareon-event(e.g.,relevanttotheCaliforniashooting)andURLs/webpagesthatareoff-event(butstillon-topic,i.e.,aboutashooting).Weconductedfourseriesofexperimentstoevaluateoursystemusingasetofrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattacks,andOregonshooting.Thefirstexperimentseriesevaluatedtheeffectivenessofourproposedeventmodelrepresentationwhenassessingtherelevanceofwebpages.Oureventmodeloutperformedthetopic-onlyapproaches;itshowedbetterresultsinprecision,recall,andF1-score.Thesecond
5
experimentseriesevaluatedtheeffectivenessoftheeventmodel-basedfocusedcrawlerforcollectingrelevantwebpagesfromtheWWW.Oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch);itshowedbetterresultsinharvestratio.Thethirdexperimentevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportancebutfromasmallersetofsources.ThefourthexperimentprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.Ourcontributionsfromthisresearchare:
1- Amodelandrepresentation forcapturing thedifferentaspectsofevents inwebpages(topic,location,anddate);
2- An extended focused crawler approach that uses our event model torepresentcontentandtoestimatetherelevanceofwebpages;
3- Anautomatedapproachforsocialmedia-basedseedURLselection;4- A method to estimate the value of webpages based on their source
importance;5- Anextendedfocusedcrawlerapproachthatintegrates,foreachwebpage,the
textual content information based on our event model, along with thewebpagesourceimportance,intoasinglerelevancescore.
Thefollowingsubsectionsintroducethehypothesestestedthroughthisdissertation,theresearchquestionsinvestigated,the5Sapproachusedtoguideourwork,andtheorganizationofthefollowingchapters.
1.2 HypothesesThisresearchconsidersthefrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.Wetesttwohypothesesaboutfocusedcrawling,thefirstofwhichhasthreeparts:
H1:Usingseveralsourcesofevidenceofwebpagerelevance,includingcontent-basedfeatures,willhelpidentifymoreofjusttherelevantwebpages.
H1.1Usingtemporalandspatialinformationaspartofmodelingandrepresentingeventsleadstobetterdescriptionsofeventsthanthetopic/keyword-basedapproach.
6
H1.2:Incorporatingeventinformationextractedfromawebpage’scontentintofocusedcrawlingwillincreaseeffectiveness,helpingidentifymoreoftherelevantwebpageswhilemaintaininghighlevelsofprecision.H1.3:Awebpagebelongingtoanimportantwebsite(e.g.,www.cnn.com)shouldleadtoalargernumberofrelevantwebpageslinkedtoit.Thesourcewebsite(e.g.,www.cnn.com)isactinglikeahubwhichleadstorelevantwebpages.
H2:Integratingeventinformationandwebpagesourceimportancewillimproverelevancepredictionpower,ensurebalanceintypesofcontentcrawled,andreducebias.
1.3 ResearchQuestionsThisdissertationaddressesfiveresearchquestions,listedbelow,thatmap,asshowninparentheticals,tothehypothesesgivenabove.R1:Howtomodelandrepresentanevent?(H1.1)R2:Howtocomparetwoeventrepresentations?(H1.1)R3:Whatistheeffectofintroducingeventhandlingontheperformanceoffocusedcrawling?(H1.2)R4:Howtomodelandrepresentwebpagesourceimportance?(H1.3)R5:Howtointegratehandlingofeventsandinformationsources?(H2)
1.4 5SOursystemisdesignedtakingintoconsiderationthe5Sframework[11-14]fordevelopingdigitallibraries.Wedescribeherethedifferentdimensionsofthe5Sframeworkandhowtheyareappliedinoursystem.Weareusingthe5S(societies,scenarios,spaces,structures,streams)digitallibraryframeworkfortwomainreasons.First,5Sprovidesachecklistanddesignguidelinesthathelpintheunfoldingofourresearch.Second,themaingoalofoursystemistobuildaneventdigitallibrary,whichisakeypartoftheIDEALproject,whichprovidesservices(byimplementingscenarios)foravarietyofusers(societies).Wedemonstratehowfocusedcrawlingaddscontent(digitalobjects)totheIDEALeventdigitallibrary.Thekeydigitalobjectsaretweetsandwebpages,thateachcanbeviewedasacombinationofstreamandstructure.Anotherofthemainobjectsinourfocusedcrawleristheeventobject,acombinationofstreamand
7
structureandspace.The5Smodelallowsustotreateventsasfirstclassobjects(i.e.,eventobjectscanbecreated,stored,sortedondifferentattributes,searched,andvisualized-basedonattributeslikelocation).Amaindesignconsiderationofourfocusedcrawleristheabilitytoreceiveaneventobjectasinput,andthenstartcrawlingtheWWWforwebpagesthatarerelevanttothateventobject,therebyimplementingavariantoftheWebcrawlingscenario.Morediscussionof5Sanditsconnectiontothisdissertationresearchfollows.Societies:Thesystemcanservedifferentstakeholderslike:historians,thoseinthegeneralpublic,researchers,analysts,responders,eventparticipants,anddecisionmakers.Inaddition,therearesoftwareagents;seemorebelowunderServices.Scenarios:Inadditiontoactivitiesofsoftware(seemorebelowunderServices),thefollowingwillbesupported:
• AUsercanstartbuildingacollectionusingtheeventfocusedcrawler.TheusershouldprovideasetofURLs,collecttheoutputofasearchengine,orsamplefromsocialmediacontentthatisprovidedasinputtooursystem.
• AUsercanvisualizeacertaineventorgroupofevents.Severalvisualizationschemesareprovided,includingmapsandtimelines.
• AUsercanbrowsethesystemcontentaccordingtofacetsorothermechanisms,considering:eventcategories,contenttype,entities,andarchives.
• AUsercansearchallsystemcontenttypes.• AUsercansearchacertaineventcollectionorseveralcollections.
Spaces:Objectsinoursystemcanbeviewedinseveralspaces:
• Webpagedocumentsarerepresentedusingthevectorspacemodel[15,16].• Eventsarerepresentedinaneventspacemodel(topic,location,anddate).• Two-orthree-dimensionalinterfacesaiduserinteraction.• Probabilityspacesaidwithcharacterizinginterdependenciesanddrawing
inferences.Structures:Organizationofobjectsinoursystemcanbeperformedaccordingto:
• Eventcategory(earthquake,flood,hurricane,shooting,planecrash,etc.),whichcanfitwithinataxonomy,ontology,orothertypeofstructure;
8
• Datastreamtype/schema(fortext,image,audio,video,eventrecord,archive);or
• Entity(location,date,personname,organizationname).Representationsinclude(metadata)recordsindatabases,graphs,trees,etc.Streams:Theseinclude:webpagetexts,Webarchives,crawllogs(datarecordinghowthefocusedcrawlercreatedthecollection),images,videos,audiofiles,eventsummariesordescriptions,andothersimilarentitiesassociatedwithevents.Weconsideredeventsasfirst-classobjectsandcreatedcharacterizationsforeachevent.Usingeventrelatedstreams,youcansearchforanevent,browseevents,visualizeevents,etc.Services(notoneofthe5Ss,butrelatedtoSocietiesandScenarios,andhelpfulindescribingaDL):Oursystemwillprovideservicesincluding:
• Creatingortransformingcollections/archives;• Analyzingcollections(extractingentities,providingsummaries,classifying,
clustering);• Searchingallkindsofdatastreams;• Browsingaccordingtoeventcategories,datastreamtypes,and/orentities;
and• Visualizingcontent.
1.5 ThesisOrganizationThisdissertationisorganizedasfollows.Chapter2discussesthearchitectureofthebaselinetopical/focusedcrawler.Chapter3reviewsdifferentapproachesforfocusedcrawling.InChapter4,weproposeourneweventmodelandrepresentation,andexplainhowitisintegratedwiththefocusedcrawlingapproach.Chapter5coversthedesignofexperimentsperformed,whileChapter6presentstheevaluationofoureventmodel-basedfocusedcrawlerandofthebaselinefocusedcrawler.Chapter7explainsthewebpagesourceimportancemodelanditseffectonourfocusedcrawler.Finally,Chapter8concludesanddiscussesissuesforfutureresearch.
9
2 RelatedWorkWediscussthedifferentfieldsrelatedtoourworkinatop-downmanner.FirstwediscussthebiggergeneraltopicofWebcrawling.ThenwediscusstheIDEALprojectandWebarchivingtechniques.Thenwediscussfocusedcrawlertechniques.Mostoftheworkdoneintraditionaltopicalfocusedcrawlingfallsintooneofthreecategories:machinelearning,semanticsimilarity,orcontentandlinkanalysis.Wediscussthemajorworkdoneinthesethreecategoriesinthenextsubsections.Alongtheway,wealsotouchonpublicationsrelatedtoeventmodeling,socialmediaintegrationwithfocusedcrawling,seedselection,andfinallyfocusedcrawlingevaluation.
2.1 WebCrawlingWebcrawlers[17]aresoftwareprogramsthattraversetheWWWfollowingthelinksonthewebpages.AcrawlermodelstheWWWasagraphwherenodesarewebpagesandedgesbetweennodesarethehyperlinksthatmanifestinthewebpages.Acrawlerstartsfromasetofwebpages,calledtheseedset,andfollowsthelinksonthosewebpages.Itdownloadsthecorrespondingwebpages,extractsthelinksinthem,andthenrepeatsthewholecycle.Acrawlerkeepstwodatastructuresthatfacilitatethecrawlingprocess[18],theURLqueue(frontier)andthevisitedURLslist.ThecrawlerusesthefrontiertokeeptheURLsthatareextractedfromwebpagesbutnotvisitedyet,andthevisitedURLslisttokeeptrackoftheURLsthatwerevisitedsoitwon’tvisitthemagain(ortocontrolthefrequencyofvisitingthemagain).Webcrawlershavebeenusedbysearchenginestocollectasmanywebpagesaspossible.Thesearchengineparsesthecollectedwebpages,extractsthetext,andbuildsasearchindex[18].Theindexisthemainelementusedtosupportthesearchingservice.
2.2 TheIDEALprojectTheIDEALproject[5]teamhasdevelopedover1000tweetarchivesaboutgeneraltopicsand/orevents,alongwithover66Webarchivesofman-madeandnaturaldisasters,thelatterusingtheInternetArchive’sgeneralcrawler[10].
10
Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)
Eventarchivingisdifferentfromdomain/site-basedortopic-basedarchiving.Thefirstinvolvesarchivingaspecificdomain/websitewithallorsomeoftheunderlyingsubdomains/structure.Thesecondcoversagivennumberofwebpagesrelatedtoauser-definedtopic. TheIDEALprojectteamhasidentifiedandemployedthreeapproachesforarchivingwebpagesaboutevents:
1. Manualcurationbydomainexperts,librarians/archivists,andgovernmentagencies.(Highquality–timeconsuming).Seehttps://archive-it.org/explore?show=Collections&fc=meta_Subject:Spontaneousevents
2. Socialmedia-based(crowdsourcing)curationbyextracting,retrieving,andarchivingURLsfromtweetcollectionsaboutanevent(Lowquality–timesaving).Seehttp://www.eventsarchive.org/?q=node/42
3. CrawlingtheWebusingafocusedcrawlingapproachtailoredtoevents(withacceptablequalityandtime)
TheIDEALprojectteamhasusedthefirstapproachandcreatedaround66Webcollectionsaboutdifferentkindsofevents[19].TheymanuallycuratedseedURLsandfedthemintotheArchive-Itserviceforcrawling(seeFigure3).TheyhaveapplieddifferentsettingsoftheArchive-Itconfigurationparametersaccordingtotheimportanceoftheeventaboutwhichtheyarecollecting,andthetypeofURLstheycurated.Thetwomainconfigurationparametersarethefrequencyofcrawling
Event
Collect*Tweets
Tweet*Collection
ExtractURLs
Shortened*URLs
Expand Original*URLs
Fetch Webpages
Archive WARC
Index SOLR
Browse
Wayback
Search
Access
Keyword/Hashtag
Collect Archive/Organize/Analyze
11
andthescopeofcrawling.Thefirstonecontrolsthefrequencybywhichthecrawlershouldrevisitandre-crawlthewebpage,whilethesecondparametercontrolswhetherthecrawlershouldfollowtheoutgoinglinksinthewebpage.TheIDEALteamautomatedtheprocessofcuratingseedURLs[20]bycollectingtweetsabouttheeventsofinterestandthenextractingURLsfromtweetsandfeedingthemtotheArchive-Itserviceforcrawling.Socialmediaingeneral,andTwitterinparticular,provideaveryrichsourceforuser-generatedcontent,whichcontainsalargenumberofURLs.TheIDEALprojectteamhascreatedaround1000tweetcollectionsaboutgeneraltopicsandspecificevents[21].Thetweetcollectionssufferfromnoisycontentlikeporn,jobads,marketing,etc.TheIDEALprojectteamhasappliedseveralfilteringmethodstoensuretheresultingtweetcollectionscontainrelevantcontentonly.Figure2showstheworkflowforcuratingseedURLsfromsocialmediasources(e.g.,Twitter).TheresultingseedURLsarearchivedusingtheHeritrix[10]toolandthentheresultingWebarchivesareindexedbyasearchengineforprovidingaccess,searching,andbrowsingservicesforusers.Thelastapproachisaimedtomaintainabalancebetweenproducinghighqualityeventcollectionsandreducingthetime/resourcesneededforcollectionbuilding.TheIDEALprojectteamhasdevelopedtoolsforsemi-automaticallycollecting,curating,andarchivingwebpagecollections,leveragingmethodsforeventmodelingandfocusedcrawling.TheeventmodelingcoversespeciallyidentifyingandrepresentingeventsconsideringtheirWhat,Where,andWhenaspects.TheIDEALproject’sworkonfocusedcrawlingcouldbeofbenefitforWebarchivingby:
1. HelpingpreparelistsofURLstobearchived(i.e.,afocusedcrawlerrecommendingaseedlist);
2. Helpingextendacollectionautomatically(usingexistingcollectionsformachinelearningtypetrainingofafocusedcrawlertofindsimilarnewwebpages);and
3. Analyzingandsummarizingtheproducedeventcollectionsbyusingthedevelopedeventmodel.
2.3 WebArchivingandArchive-ItserviceCloselyrelatedworkhadbeendoneintheemergingandpromisingfieldofWebarchivinganddigitallibraries(WADL)[22-24].ThemostrelatedaspectofWebarchivingistheselectionofwebpagestobearchivedandhowtocollectthesewebpages,aprocesssometimesknownasWebcuration,typicallygovernedbya
12
selectionpolicy.GeneralWebcrawlersarethedominantmethodforWebarchiving.However,severalnewtechniqueshaveemergedwhichdonotdependoncrawlingtechnology,butratherdependonthetransactionalbehavior[23]oftheWWW(HTTPprotocol)todrivearchivingofwebpages.ManywebpagearchivesarecreatedbycuratorsusingtheInternetArchive’sservicecalledArchive-It[9],whichhelpswithharvesting,building,andpreservingcollectionsofdigitalcontent.TheservicetakesURLsasinputfromauser.Figure3showstheinterfaceforuserstomanuallyentertheURLsfromwhichthecrawlerwillstartcrawling.TheseURLsareusedbyArchive-IttocrawltheWeb,guidedbymanualconfigurationdetails(scopingofthedomainofwebpagestocrawl,typesoffilestocrawl,followingrobots.txtprotocol,etc.),andtheresultingwebpagesarecapturedandstoredinWARC[25]files.TheArchive-Itservicehasprovidedanautomatedwayfordifferentkindsofuserstoarchiveandsaveimportantwebpagesinwhichtheyhaveinterest.However,themethodologyusedinthecrawlertechnologybehindtheArchive-Itserviceisorientedtowardarchivinggeneralwebsites(likegovernment,state,universitylibraries,andfederalwebsites)wherethewholecontentofthewebsiteandthefrequentchanges/updatesofthewebsitearethemainscopeofthearchivingprocess.Forspontaneouseventsthisapproachisnotwellsuited.Mostoftheevent-relatedcontentinvolvesonlyspecificwebpageswithinawebsite,andthosewebpagesarenotfrequentlychanged/updated.Therefore,usingtheArchive-Itservicemayresultinanarchivewithmostofthecontentnotrelatedtotheeventofinterest.In[7]theauthorsanalyzedWebarchivesaboutschoolshootings.TheirresultsshowthatrepresentativeWebarchivesarenoisy,with2%-40%ofwebpagesreflectingrelevantcontent.
13
Figure3SeedURLsmanualcurationusingArchive-Itservice
2.4 FocusedCrawling
2.4.1 MachineLearningMachinelearningbasedfocusedcrawlerapproachesapplytextclassificationalgorithms[26-28]tolearnamodelfromtrainingdata.Thefocusedcrawlerthenusesthemodeltoestimatetherelevanceofunvisitedwebpages.Useofthemodelenhancestheperformanceoftheclassifierbyincorporatingdomainspecificknowledgeandonlinerelevancefeedback.Ourapproachlikewisecanbeconsideredasinvolvingaclassificationtask;werequiretrainingdataforcalculatingtheweightsofthedifferentaspectsoftheeventandweareusingthewebpagetextforbuildingtheeventmodel.RennieandBarto[29]usedreinforcementlearningforsolvingthefocusedcrawlingproblem.TheymodeledthefocusedcrawlingproblemasaMarkovdecisionprocesswithwebpagesasstates,URLsasactions,andon-topicwebpagesastherewards.Anotherreinforcementlearningalgorithm,temporaldifferencelearning,wasused
14
in[30].Theyusedastatevaluefunctiontoestimatetheimportanceofwebpagestoleadtofuturerelevantwebpages.Inourapproach,weusewebpagesourceimportancetoestimatethevalueofwebpagesforlinkingtootherrelevantwebpages.Anotherwork[31]usedthereinforcementlearningframeworkproposedin[29]andenhanceditsperformancebyapplyingincrementalonlinelearning.ForeachnewURL,theyestimateitscorrespondingclassanduseitsfeaturestoupdatetheclassfeaturesanditscorrespondingq-value.Thentheyretrainthesupervisedlearningalgorithmbasedonthenewtrainingdata(oldtrainingdataandthenewURLsseen).Thisapproacheliminatesthedatabiasthatappearsinthetestdata,whereunseenURLsmayappearfromnewdomainsthatwerenotfoundinthetrainingdata.InourapproachweaddressbiasthroughusingwebpagesourceimportancebothinselectingseedURLsandalsoduringcrawling.Infospidersisatopicalcrawlerbasedonadaptiveonlineagents[32]thatusegeneticprogrammingandreinforcementlearningapproachestoestimatetherelevanceofawebpage.Inourapproachwegobeyondjustusingtopics,anduseeventmodelingforestimatingtherelevanceofawebpage.In[33],afocusedcrawlerwasdevelopedforcollectingwebpagesthatcontainsemanticinformation(semanticannotationsexpressedasstructureddataembeddedintheHTMLofthewebpage).TheauthorsproposedanewmethodologyforcrawlingtheWebthatutilizesanonlineclassifierandareinforcementlearningbanditalgorithmforselection.Theonlineclassifierlearnstodetecttherelevantwebpagesduringcrawling,eliminatingtheneedfortrainingamodelbeforecrawling.ThebanditalgorithmprovidesaframeworkfororderingandselectingthenextURLtovisitduringcrawling.TheURLsinthecrawlerfrontieraregroupedbytheirWebhost/domainandeachhost/domainisscoredaccordingtoanimportancemeasure.Thecrawlerchoosesthehost/domainwiththehighestscoreandthenchoosesfromthathost/domaintheURLwiththehighestscore.
2.4.2 SemanticSimilaritySemanticsimilarity-basedtechniquesuseontologies[34,35]fordescribingthedomainofinterest.Thedomainontologycanbebuiltmanuallybydomainexpertsorautomatically,usingconceptextractionalgorithms.Oncetheontologyisbuilt,itcanbeusedforestimatingtherelevanceofunvisitedwebpagesbycomparingtheconceptsextractedfromthetargetwebpagewiththeconceptsthatexistintheontology.Theperformanceofsemanticfocusedcrawlingdependsonhowwelltheontologydescribesandcoversthedomainofinterest.Manyofoureventsare
15
disaster-relatedevents.Althoughwearenotusingdomainknowledgeoreventontologies,yetoureventmodelcouldbeeasilyextendedtomakeuseofdisasterdomainknowledgeforfindingdisaster-specifickeywords.Oureventmodelalsocouldbeeasilymappedtoportionsofadisaster-relatedeventontology[12,36],butfurtherresearchwouldbeneededtoseeifsuchwouldyieldimprovement.Further,usingthisapproach,inasystemthatisaimedtohandleanytypeofimportantevent,wouldrequireconsiderableknowledgeengineeringwork.Relatedtosemanticsissentiment.Sentimentanalysishasbeenintegratedintofocusedcrawlingintwoways:sentiment-focusedcrawling[37]andfocusedcrawlingleveragingsentimentinformation[38].Inbothworks,crawlersmadeuseofthesentimentinformationtobuildthetargetmodelandtoguide,focus,anddirectthecrawlerthroughtheWebgraphbyestimatingtherelevanceofunvisitedURLs.Sentimentorientedcrawlingisconsideredatypeoftopicalcrawling,wherethetargetistocrawlwebpagesthathaveagivensentiment.Sentimentclassescouldbeassimpleaspositivevs.negative,orcomplexbasedonaspecificdomain.Inourwork,eventshavemorecomplexstructurethansentiment.Wecanleveragesentimentinformationabouteventsbutwehavetofindtherelevantwebpagesabouteventsfirstandthenanalyzethesentimentinformationinthosepages.
2.4.3 ContentandLinkAnalysisTextandlinkanalysisalgorithmscombinetextanalysisschemes(e.g.,VectorSpaceModel(VSM)[15,16])andlinkanalysisalgorithmstoestimatetherelevanceandimportanceofwebpages[39-41].Linkanalysisapproachesintroducetheconceptofpopularwebpages.PopularityismeasuredbasedonthelinkstructureoftheWWW.ThisledtotheintroductionoftheconceptsofHubandAuthoritywebpages[42].Hubwebpageshavelinkstomanyauthoritywebpageswhileauthoritywebpagesarelinkedtomanyhubwebpages.Amongthelinkanalysisalgorithms,PageRank[43,44]isthemostused.Alternatively,contextgraphsareusedtorepresentthecontextofawebpageusingneighborhoodwebpagesthataremostsimilartoit.Anotherlineofresearchincorporatesthegenreofwebpagesintofocusedcrawling[45].Thegenreofthewebpagedefinesthetypeofthewebpage(e.g.,forum,tutorial,news,blog,course-syllabus,etc.).Thefocusedcrawlerusestwosetsofkeywords,onefordeterminingthegenreofthewebpageandtheothersetfordeterminingthetopic.Thetwosetsofkeywords(genreandtopic)aremanuallydeterminedbyexpertsandthenusedbythefocusedcrawlerforestimatingtherelevanceofthewebpage.
16
Relatedtowebpagesourceimportance,Pantetal.[46,47]describeanewWebcharacteristic:statuslocalityontheWeb.Awebpage’sstatusmeasurestheimportanceofthewebpagewithrespecttoitspopularity,andisapproximatedbythenumberoflinkspointingtoit.Pantetal.developedanalgorithmforestimatingthestatusofawebpagebasedonlocalcharacteristicsofthewebpageandalsodemonstratedthatthestatuspropertyhassomeofthesamecharacteristicsasthetopicalproperty.Ourapproachalsotriestopredicttheimportanceofawebpagebyexaminingtheimportanceofthedomaintowhichthewebpagebelongs.Theimportanceinourcaseisviewedwithrespecttothedomainoftheeventofinterest.Forexample,inanearthquakeevent,somewebsitesmaybemoredominant,andthusmoreimportant,thanotherwebsites,basedonthenumberofrelevantwebpagesfoundforthatdomain.Chen[48]developedahybridapproachforfocusedcrawlingusinggeneticprogrammingforexploitingdifferentfeaturesinawebpage’stext,andmetadatasearchforexploringdifferentsourcesontheWWW.Chenappliedthegeneticprogrammingapproachforcombiningdifferentrelevancesignalsfromthewebpagetext.HealsousedmetadatasearchforgatheringseveralseedURLsforthecrawlertostartfrom,thusexpandingthecrawler’scoverageoftheWWW.Inordertoovercomethebiasthatcanbefoundinonesearchengine,heusedmultiplesearchenginesandcombinedtheirresults.Ourapproachalsoaimstoimproverecallincrawling,butusesothertypesofevidence(webpagesourceimportance)todoso.
2.5 EventModelingEventmodelingrecentlyhasgainedpopularityindifferentfields,liketopicdetectionandtracking(TDT)[49],animaldiseaseoutbreakdetection[50],networkedmultimediaevents[51],anddocumentsimilarity[52].InTDT,aneventisdescribedasatopicthathappensatacertaintime,inaspecificlocation,andwithaparticularsetofparticipants.Inmultimediaapplications,aneventisdefinedasatupleofaspects:informational,spatial,temporal,structural,causal,andexperiential.TheinformationalaspectincludeseventID,eventtype,andanyotherattributesthatserveasidentificationoftheevent.Spatialandtemporalaspectsrepresentthelocationandtimeproperties,respectively.Thestructuralaspectincludesthesub-eventsbelongingtothecurrentevent.Thecausalaspectincludestheeventscausingthecurrentevent.Finally,theexperientialaspectincludesallmediaresourcesrelatedtothecurrentevent.Weincorporatedeventmodelingintoourcrawler,tobuildanevent-awarefocusedcrawler[3].WecameupwithaneventmodelthatintegratesideasfromTDT[49]
17
(webuiltoureventmodelusingthesamedefinition:somethingthathappenedatacertainplaceonaspecificdate)andworkdonein[50](theydefinedaneventasacombinationofdomain-relatedkeywords,locationentitiesanddateentities,anddiseasenameentitiesthatappearonthesentencelevel).Weusetopic,location,anddateinformation,butaggregateatthecollectionlevel.Aprobabilisticeventmodelhasbeendevelopedin[53]whereaneventismodeledasalatentvariablethatgenerateswebpages,i.e.,theobservations.Awebpageaboutaneventismodeledasatupleofthreeparts(topic,entities,anddate).Topicandentitiesaremodeledasvariableswithmultinomialdistributionoverwordsinwebpagecontent,anddateismodeledasanormaldistributionoverthepublishingdateofthewebpage.Accordingtothismodel,usingatimeanalysisofpublishingdatesofwebpages,aneventisrepresentedbyapeakinthenumberofwebpagespublishedaroundacertaindate.Apeakcouldrepresentoneeventormultipleeventsthathappenedonthesame/closedates.Thetopicandentitiespartsofthemodelhelpindiscriminatingbetweeneventssharingthesamepeak.In[54,55],theLDAtopicmodelingframeworkisusedtomodelevents.Aneventisrepresentedbyamixtureoftopicsoverasetofwebpages.Thetopicsconsideredarebackgroundtopic,topicforeachevent(extractedfromthesetofwebpagesbelongingtothatevent),andtopicforeachdocument.Thereasonfordividingtopicsinthismanneristocaptureandseparatethelanguageusedforgeneralpurposes(backgroundfamouswords),thelanguageusedforeacheventspecifically,andfinallythelanguageusedforeachwebpagespecifically.Eventshavebeenanalyzedinsocialmedia(likeTwitter)usingtailoredmethodsforextractingevent-relatedinformationfromthetweets[56-59].Eventsaremodeledasasetofwordswithspecificstructurelike“subjectverbobject”.Thepurposeoftheresearchisnotcollectinginformationrelatedtoevent,butrathergivenasetofshorttext(tweets),howtoextractevent-relatedinformation.Thistypeofworkmodelseventsatthefine-grainlevel,whereitlooksforspecificstructureinsentencesthatmightrepresentanevent.Inthisdissertation,eventsaremodeledatacoarse-grainlevelusingthecombinationofwordsatthewebpagelevel.In[60]eventsaremodeledastheco-occurrenceofspatialandtemporaltokensinonesentence.Theauthorsprovidedahierarchyformodelingspatialandtemporaltokens.Forspatialhierarchytheyusedcountry,state,city,andstreet,whilefortemporalhierarchytheyusedyear,month,andday.Usingthesehierarchies,the
18
authorsdevelopedasimilarityfunctionforestimatingthesimilaritybetweendocumentsusingeventmodelinstancesextractedfromthedocuments.In[61]theauthorsanalyzedthecharacteristicsofinformationsourcesduringnewsevents.Theyanalyzedthreetypesofinformationsources:mediaoutlets,socialmedia,andquerylogs.Theyanalyzedthreeevents:SanBrunopipeexplosion,NewYorkstorm,andAlaska3-wayelections.Theyanalyzedtheseeventsbecauseallthreesharedthesamecharacteristics:1)locationiscentralized(i.e.,nomultiplelocations),2)timespanislimited(i.e.,sotheyattractattentionandinterestforashortperiodoftime).Theirdefinitionofeventsissimilartoourdefinition:somethinghappenedatacertainplaceandspecifictime.Thus,thereareseveralwaysfordefininganeventdependingonthecontextandtheapplicationinwhichtheeventisused.Oureventmodelissimilartotheprobabilisticeventmodel[53],whichincorporatesthetopic,date,andentitiesaspects.Wedoidentifyadateforanevent.Wealsorecognizelocationentities.However,othertypesofentitiesfoundareincludedinthetopicpartasnormalkeywords.Oureventmodelcanbeeasilyextendedtoaddothertypesofentities(Persons,Organizations,etc.).Combiningseveraltypesofevidenceisanimportanttaskinfocusedcrawling.Initialworkhasusedcontentandlinkbasedinformation.Multimediainformationalsowasused[62],wheretextandimageswereanalyzedforestimatingwebpagerelevance.ThereaBayesiannetworkwasusedforintegratingevidencefromdifferentsourcesofinformation(textandimages).Likewise,wecombinethewebpagesourceimportancewiththewebpagerelevancescore,inparticularbymultiplyingbothtogethertoproduceafinalscore.
2.6 SocialMediaandFocusedCrawlingSeedSelectionIn[63],theauthorscombinedthefocusedcrawlertechniquewithsocialmediatoimprovethefreshnessofthecrawl.AfocusedcrawlerislimitedbythesetofseedURLsitstartsfrom.Socialmediaproducesahugeamountofusergeneratedcontent(e.g.,tweets)thatmaycontainURLs.Sincesocialmediacontentisproducedlive,theURLscontainedthereinwouldbefreshandpossiblymorerecentthantheURLsvisitedotherwisebythefocusedcrawler.InjectingURLsfromsocialmediaintothefocusedcrawler’sURLsqueue,shouldincreasethefreshnessoftheWebcollectionproduced.Theauthorscrawledabouttwoevents--EbolaandUkraineconflict--andusedakeyword-basedmodeltorepresentthetwoevents.Weusedsocialmediacontent(i.e.,Twitter)asasourceofseedURLs.WecollectedtweetsabouteventsandafterthecollectionprocessfinishesweextracttheURLsandfilterthemtoget
19
relevantURLs.In[64],theauthorsexaminedthetopicalqualityofexistingWebarchivesaboutevents.TheybuiltaframeworkthatassesseswhethertheseedURLsusedinbuildingtheWebarchiveareon-topicoroff-topicacrossthedifferenttimesitwascrawled.Theauthorsusedthevectorspacemodeltorepresentthedocumentsandappliedseveralsimilaritymeasurestocalculatethesimilarityscores.Theyevaluatedtheirmethodusingdifferentthresholdvaluestofindthevaluethatyieldsthebestperformance.WeusedthecosinesimilarityforscoringwebpagesandURLsabouteventswithdifferentthresholdvalues.Wehaveusedsocialmediacontent(i.e.,Twitter)toextractandselectseedURLsforfocusedcrawling.Unliketheworkin[58],weextractedtheURLsfromthetweetcollectionsandthenstartedcrawling.
2.7 Evaluatingtopical/focusedcrawlersSeveralmethodshavebeendevelopedforevaluatingfocusedcrawlers[65,66].In[67]ageneralevaluationframeworkhasbeendevelopedwhereanyfocused/topicalcrawlercanbeassessedaccordingtotheevaluationframework,independentlyfromthedomaintopic.Theauthorsusedthreemethodsforevaluatingdifferentfocusedcrawlers.Thefirstoneisusingaclassifiertoclassifytheresultingwebpagesasrelevantornon-relevant(on-topicoroff-topic).Thesecondmethodisusingaretrievalsystemwherethecollectedwebpagesareindexedinthesystemandspecificqueriesarerunagainstthecollection.Differentcrawlersareevaluatedbasedonthenumberofrelevantwebpagesretrievedforeachquery.Thethirdmethodisusingaveragesimilarityscores.Differentfocusedcrawlermethodsareevaluatedbasedontheaveragesimilarityscoresatdifferentstagesofthecrawl.Sincenoneofthesethreetechniquesfitswellwithourapproach,weusealternativeevaluationschemesintheworkthatfollows.Theabovementionedworkscovermuchofthebackgroundforourresearch.Whiletherehavebeenavarietyofrelatedstudies,ourinvestigationisunique,andimprovesuponthemethodswehaveuncoveredto-date.
20
3 FocusedCrawling3.1 TopicRepresentationOneoftheinputstoafocusedcrawlerisasetofURLs;togetherthesecanbeusedtodescribetheevent/topicofinterest.WerefertothemasmodelURLs;theycanbethesameordifferentfromtheseedURLs.InthisdissertationmodelURLsaresameasseedURLsforsmallscaleexperiments,whileinlargescaleexperiments(whereseedURLslistisverylarge).WeusedthemodelURLslistbecause,aswewilldescribelater,inChapter5,inthelarge-scaleexperimentsthenumberofseedURLsisverylarge.Usingallofthosetobuildamodelwouldbetimeconsuming,andcoulddelaythestartofacrawl.Accordingly,weuseasmallernumberofURLsasmodelURLstobuildtheeventmodel.Theseareselectedonthebasisofprovidinghighqualitytextualcontentabouttheevent/topic.ThefocusedcrawlerusesthissetofURLstobuilditsevent/topicmodel,andthenusesthemodeltoestimatetherelevance[65,68-73]oftheURLsandwebpagesitencountersduringcrawling.TheremainingseedURLsareaddedtothequeueforthecrawl,helpingensurebreadthofcoverageandreducingbias.Fromnowon,weusethetermseedURLsforbothseedURLsandmodelURLsforsimplicity.Weconsidertwowaystorepresentanevent.Intherestofthischapter,weconsiderthefirst,traditional,baselineapproach,whereaneventistreatedlikeatopic[74],characterizedbyasetofkeywords.InChapter4,wedescribeournewapproach,whereaneventisdescribedwitharichermodel.Wechoseabest-firstfocusedcrawlerasourbaselinemethodbecauseithasproventobethestate-of-theartmethodintopicalfocusedcrawling[65,75].Thebaselinebest-firstfocusedcrawlerusestheVectorSpaceModel(VSM)[76]approachtobuilditsevent/topicmodel:
1. UsingthemodelURLs,downloadcorrespondingwebpagesandextracttextfromthosewebpages.Eachwebpageistokenizedtoasetofwords,stopwordsremoved,andwordsstemmedandthenconvertedtoavector.Herethevectorrepresentstheuniquetermsinthewebpagesandtheirfrequencies(howmanytimestheyappearinthewebpage).
2. Thecrawlerthenbuildsavocabularyindexusingthewebpagevectors.Thevocabularyindexmapsthesetofuniquewordsinallthewebpagestoalistofthewordfrequenciesinthewebpages.
3. Usingthevocabularyindex,thecrawlercalculatesaweightforeachwordbysummingallitsfrequenciesinthewebpagesinwhichitappeared.ThiscorrespondstothewordcollectionfrequencyasopposedtothewordTermFrequency(TF).Wehaven’tusedInverseDocumentFrequency(IDF)aswe
21
areusingtheseedwebpagesonly,ratherthanalargegeneralcorpus.4. Thecrawlerselectsthetopkwordswithhighestweightsasamodelforthe
event/topicofinterest.Theweightsofthewordsarecalculatedbyusingthelogofthefrequencies,toeliminatetheriskoflongdocumentsdominatingshortdocuments.
Thebaselinecrawlerusestheevent/topicmodeltorepresenttheevent/topicofinterestandalsotomodeleachwebpageitvisitsduringcrawling.Sotheevent/topicvectorhasaslotforeachtermfoundinthevocabulary(orfeaturespace)thatarisesfromthewebpages.Morespecifically,aftergettingaURLwithhighestscorefromthequeue,thecrawlerdownloadsthewebpage,extractsthetext,tokenizesthetextintotokens(words),removesstopwords,appliesstemming,doesfrequencyanalysis,andconvertsthetextintoavectorofwordswiththeirfrequencies.Thefinalwebpagevectorrepresentationwillbeconstructedusingthewordsinthevocabularybuiltfromthemodelwebpagesandtheircorrespondingfrequenciesinthewebpages.Thewordfrequencyinthewebpageiscalledthetermfrequency(TF)intheinformationretrievalliterature[16]andispartoftheTermFrequency–InverseDocumentFrequency(TF-IDF)weightingscheme[15,16].Asmentionedintheprocedureaboveregardingstepnumber3,wehavenotusedtheInverseDocumentFrequency(IDF)becauseitusuallyiscalculatedforageneralcorpusofwebpageswhererelevantandnon-relevantwebpagesexist.Thefocusedcrawlerusesonlypresumedrelevantwebpagesfrommodel(orseed)URLsandthusalsoincludingtheIDFvaluemightleadtoremovingrelevantkeywords.Later,thecrawlerestimatestherelevanceofawebpagebycalculatingthecosinesimilaritybetweentheevent/topicvectorandthewebpagevector.Also,thecrawlerestimatesthescoresofalltheURLsinthatwebpage.ForeachURL,thecrawlercombinestheURLtokensandanchortext,convertsthemtoavectorofwordswiththeirfrequencies,andcalculatesthecosinesimilarityoftheresultingURLvectortotheevent/topicvector.ThentheURLisinsertedintothequeuewiththeestimatedscore.Thevocabulary(keywordsorfeatures)whichthecrawlerusestorepresentthewebpagevectorsandtheextractedURLvectorsisbuiltandextractedfromthesetofwebpagescorrespondingtotheseedURLs.Thewebpagevectorisusedtoestimatetherelevanceofthewebpageandproducearelevancescore.TheURLscoreiscalculatedasanaverageoftheURLvectorscore(calculatedascosinesimilaritybetweenevent/topicvectorandURLvector)andthescoreofthewebpageinwhichtheURLappeared.Sothecrawlerismakinguseofthreetypesoftextualinformation:webpagetext,URLanchortext,andURLaddresstokens.UsingwebpagetextandURLinformationwasprovedtobemoreefficientthanusingwebpagetextonly[75].Figure4showsthearchitectureofthebaselinebest-first
22
focusedcrawler,withthetopicrepresentationandrelevanceestimationprocesseshighlightedindashedboxes.
Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthelowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationin
themiddlebox.
3.2 CrawlerArchitectureAgeneralWebcrawler[17,18,77-82]consistsofwebpagefetcher(downloader)forretrievingwebpagecontents,URLsqueue(frontier)forstoringunvisitedURLs,andwebpageprocessorforextractingtextandURLsoutofawebpage’sHTML.CrawlersmodeltheWWWasagraphG(V,E)wherenodes(V)arewebpagesandedges(E)arelinksbetweenwebpages.So,twowebpages(nodes)willhaveanedgebetweenthemifonewebpagehasalinkpointingtotheotherwebpage.SimilartogeneralWebcrawling,afocusedcrawlerhasawebpagefetcher,URLsqueue,andwebpageprocessor.Inaddition,afocusedcrawlerhasatopicordomain-specificmodel,andamoduleforestimatingtherelevanceofURLsandwebpages.Typically,afocusedcrawlertakesasinput:1)thedesirednumberofpagestocollect,and2)seedURLstostartcrawlingfrom.Itoutputsthesetofwebpagesfound[26,41,66,75,83].
23
Oneoftheimportantaspectsof(focused)crawlersistheorderingoftheURLsinthequeue,whichspecifiestheorderofvisitingthenodesofthegraph.Inthefocusedcrawlerliterature[27],best-firstsearchisthemostcommonlyusedtechniqueandisconsideredthestate-of-the-artfocusedcrawler,takingintoconsiderationtheestimatedrelevanceoftheURLs/webpagesduringcrawling.AfocusedcrawlerstartsfromaseedURL.Itdownloadsthecorrespondingwebpageandextractsthetextofthatwebpage.Thefocusedcrawlerthenestimatestherelevanceofthewebpagetextualcontentwithregardtothetopic/eventofinterest.Inthenextstep,therearetwodesignoptions.Oneoptionisthatthefocusedcrawlerdecideswhetherthewebpageisrelevantornotbycomparingitsestimatedscoretoapre-definedthreshold.Ifthewebpageisconsideredrelevant,thenthefocusedcrawlerextractstheembeddedURLsfromthewebpageandinsertsthemintothequeue.TheotheroptionisthatthefocusedcrawlerextractsallembeddedURLsfromthewebpageandtheninsertsthoseintothequeue,notbeingconstrainedbythewebpagescore.Thesecondoptiontakesintoconsiderationthetunnelingphenomenaincrawling,whereanon-relevantwebpagelinkstorelevantwebpages,eitherdirectlyorthroughseveralsteps.WheninsertingtheextractedURLsintothequeue,thefocusedcrawlerhastomakeanotherdecision.OneoptionistoinsertallextractedURLs,alongwiththeestimatedscoreofthewebpagefromwhichtheywereextracted.AnotheroptionistoestimatetherelevanceofeachURLbasedonthetokensinboththeURLs’addressandanchortext,andinserttheURLanditsresultingestimatedrelevancescoreintothecorrectpositioninthepriorityqueue.WeadoptahybridapproachwhereweusetheaverageofaURL’sscoreandthescoreoftheparentwebpagefromwhichtheURLwasextracted[75].Next,thefocusedcrawlerpullsfromitsqueuetheURLwithhighestscore,andrepeatstheprocess.Figure5showsafocusedcrawleralgorithmthathandlestunneling(i.e.,extractstheURLsfromthewebpageregardlessofscore):estimatingthescoreofeachURLandinsertingitintothequeuewithitsestimatedscore.Weconsiderthisapproachasthefoundationforthebaselineforevaluationcomparisons.
24
Figure5Baselinefocusedcrawleralgorithm
3.3 LargeScaleDesignConsiderationsIdeally,thefocusedcrawlershouldscorealltheURLsitextractsfromawebpageandinsertthemintoitsfrontierbasedontheirscores.Whenthesituationissmallscale,thefrontiersizeismanageable,howeveriflargescale,thefrontiersizecouldgrowveryfast,andslowdowntheperformanceofthefocusedcrawler,duetomemoryconstraints.
!!!Algorithm*!Baseline!Focused!Crawler!!Input:!Seed!URLs,!pagesLimit,!pageScoreThreshold,*urlScoreThreshold!!Insert!seed!URLs!in!priority!queue!##*Topic*Representation*topicVector!=!Build!topic!representation!from!seed!pages!!##*Crawling*while*pagesCount!<!pagesLimit*and*priorityQueue!is!not!empty:*** URL!=pop!(priorityQueue)!! append!URL!to!visited!list!!!!!!!!!!!!!!!page!=!download!(URL)!! ##*Preprocessing*(Vector*Space*Model)*! pageVector!=!process!(page!text)!
##*Relevance*Estimation*(Cosine*Similarity)*pageScore!=!calculateScore(pageVector,!topicVector)!
!!!! pagesCount!+=!1*************if*(pageScore!>=!pageScoreThreshold):************** * page.getURLs!()!!! !!! relevantPagesCount!+=!1!! !!! save!page!to!eventNrelated!collection!
for*link!in!page.outgoingURLs:!********************* * URL!=!link.address!
!!!!!!!!!!!!!!!!!!!!! validate!(URL)!*************** ******* if*URL!not*in!visited!list!and*URL!not*in!priorityQueue:!
##*Preprocessing*(Vector*Space*Model)* * *urlVector!=!process(URL!text)**##*Relevance*Estimation*(Cosine*Similarity)*
***************************** urlScore!=!calculateScore!(urlVector,!topicVector)!*************** ******************* if*urlScore!>=!urlScoreThreshold:********** * * * * push!(URL,!priorityQueue)!!
25
Thefrontierisconstructedusingapriorityqueue,oftenimplementedusingamaxheap[84,85].Themaxheapdatastructureprovidespopandpushoperationslikeanormalqueue,andmaintainsthemaxheapproperty,i.e.,thattheelementwiththemaximumvalueisalwaysontopoftheheap.Thus,thepopoperationwillalwaysresultintheelementwithmaximumvalue.ThepopoperationisofO(1),whilethepushoperationwillhavetoinserttheelementinitscorrectpositiontomaintainthepropertythatthemaximumelementisalwaysonthetopoftheheap.ThepushoperationisofO(logn)wherenisthenumberofelementsintheheap.ThepushoperationisrepeatedforeveryURLextractedwhilethepopoperationisrepeatedforeveryURLvisited.Therefore,thefocusedcrawlerwouldspendaconsiderableamountoftimeinmaintainingthemaxheapproperty,especiallywhentheheapsizegrows.Thusduringlarge-scalecrawlingexperiments,welimitthenumberofURLsofthemaxheapbyinsertingonlyURLswithascorebiggerthanagiventhresholdanddiscardURLswithrelevancescorelowerthanthegiventhreshold.;wesetthethresholdto0.1empirically.Anothercomponentthatconsumesmemoryinlarge-scalesituationsisthevisitedURLslist.ThislistkeepstheURLsthatthefocusedcrawlerhasvisited(andfetchedtheircorrespondingwebpages)sothatthefocusedcrawlerdoesn’tvisitthemagain.ForeveryURLextractedorpoppedfromthefrontier,thefocusedcrawlerchecksiftheURLexistsinthevisitedURLslist.Usinganormallist,thisoperationwillbetimeconsumingwhenthelistsizeisverylarge.Forlarge-scaleexperiments,weimplementedthevisitedURLslistusingahashtable,wherethehashkeyistheURLaddress.Checkingifakeyexistsinahashtableismuchfasterthansearchingalist.Theabovementionedapproachcharacterizesthebaselinefocusedcrawlerusedinevaluationstudiesreportedbelow.Theeventfocusedcrawlerusesthesamegeneralapproach,butvariesduetoitsuseofaneventmodel(and,later,webpagesourceimportance).Thus,comparisonsallowdeterminingtheeffectsoftheeventmodelandwebpagesourceimportance.
26
4 EventFocusedCrawler4.1 EventModelandRepresentation
4.1.1 EventModelingInthefocusedcrawlerliterature[74],eventsgenerallyareconsideredastopicsandwererepresentedwithalistofkeywords.Althoughthisapproachmightworkwellforsomeevents,itworkslesswellforotherkindsofevents.Forexample,representingeventsasalistofkeywordswouldworkincaseswherethetopicpartoftheeventismostdominantandimportant,whilethelocationandtimepartsaren’timportantordon’tplayasignificantroleintheevent.TheoutbreakofEbolaisagoodexampleofaneventwherethetopic(spreadingofEbola)isthemostimportantaspect.Thelocationanddatearepartofthedetails,butarenotthatimportant,i.e.,thetopicpartislargelysufficienttoclearlydescribetheevent.Ontheotherhand,shootingevents,forexample,can’tbedescribedwiththetopicpartonly.Sincetherearemanyshootingeventsindifferentplacesandatdifferenttimes,weneedthelocationanddatepartstoclearlydescribeaparticularshootingevent.Inthisdissertation,wearefocusingonunusualreal-worldnewseventswhichcauseorcreateanimpulseinthepeople’sinterestandthemediacoverageabouttheevent[61].Goodexamplesofthistypeofeventarenaturaldisasters,elections,shootings,terroristattacks,andincidents(pipeexplosion,craneorbuildingcollapse,etc.).Thistypeofeventischaracterizedbytwomainthings:1)theyhaveacentralizedphysicallocationand2)theyhavealimitedtimeperiodinwhichtheireffectappears(impulseinmediacoverageandpeopleinterest).Evenforelections,weareconcernednotwiththegeneralaspectoftheelectionbutwithaspecificincidentthatcapturespeople’sinterestorcausesanimpulseinmediacoverage.Forexample,theAlaskaelectionsonNovember2,2010capturedpopularattentionbecauseofa3-wayracewhichanindependentcandidatewon[61,86].Thisincontrasttomanyothereventdefinitionsindifferentfields.Inthecomputationallinguisticsfield,aneventcanbedefinedas“asituationthatoccurs”whileinthemultimediafielditisdefinedasachangeinthestateofanobjectinavideoorphotostream.Intheinformationextractionfield,thefocusisonextractingreal-worldlocaleventslikeconcerts,theaterperformances,birthdayparties,etc.;informationcomesfromtextualcontentandyieldsrecordsofeventsinwhichpeoplehaveinterest[87].
27
EventModel:Beforeconsideringcomplexeventmodels,asimpleschemeshouldbetestedfirst.Thus,wedefineaneventassomething(e.g.,adisaster),whichhappenedinacertainplace,andatacertaintime.Thus,aneventEisatuple<T,L,D>.Thethreepartsreflectwhat,where,andwhen.Thus,Tisthetopicoftheevent,Lisitslocation,andDisitsdate.Theseareexplainedbelowinmoredetail.Topic:UsingasetofseedURLs,wecreateaneventvocabulary(asetofuniquekeywordsthatappearfrequentlyinthewebpagesassociatedwiththoseseeds).Werepresentaneventwithareferencevectorcreatedbytakingthetopkeywordsfromtheeventvocabulary.Date:Theeventdateisgivenbyauserorisextractedautomaticallyfromthesetofseedwebpages.Theeventdaterepresentsthestartingdatewhentheeventfirstoccurred.Theeventalsocouldhaveanendingdate,orevenasetofperiodsintime.Location:Asmallsetoflocationentitiesislikelytoappearfrequentlyinmostoftheseedwebpages,representingplacesrelatedtotheevent.Theselocationentitiesareextracted(asdescribednext)fromseedwebpages’texts;weperformafrequencyanalysistohelpfindthemostimportantlocationentitiesmentionedintheseedwebpages.Forexample,wemodeltheshootingthathappenedinSanBernardino,CaliforniaonDecember2,2015asfollows: Topic:shooting,shooter,… Location:SanBernardino,California Date:12/02/2015Similarly,wemodeltheattackthathappenedinBrussels,BelgiumonMarch22,2016asfollows: Topic:terror,attack,explosion,… Location:Brussels,Belgium Date:3/22/2016Oureventmodel(combiningtopic,location,anddate)candefaulttothetopic-onlymodelincaseofEbolabyignoringthedateandlocationpart(bysettingtheweightsofthelocationanddatepartstozero).WenotethatthiswouldbethecasealsoregardingtheZikavirusdiseaseoutbreak.Wecanaddapartinoursystemthatiftheeventtypeisdiseaseoutbreak(manuallyenteredbytheuser),thesystemautomaticallydefaultstothetopic-onlymodel.Alternatively,otherdefaults,likegivingasmallweighttothelocationand/orthedatepart,couldbeinstituted.Thus,oursystemisflexible,andcanbeusedincaseswhereitisdifficulttodetermine,foragivenevent,whichmodelismoreefficient.Butiftheeventisa
28
news/worldeventwhichisphysicallylocalized(hasaclearcenter)andtemporallylimited(withanimpulseinnumberofarticlespublishedinarelativelyshortperiod)thenwebelieveoureventmodel(combiningtopic,location,anddate)should performmoreefficientlythanthetopic-onlymodel.Thus,ifauserisinterestedintheoutburstofZika/Eboladiseaseinacertainplaceandatcertaintime,thenoureventmodel(combiningtopic,location,anddate)shouldperformbetterthanthetopic-onlymodel. Figure6showsthestepsofbuildinganeventmodelfromseedwebpages.WestarttheprocessofbuildinganeventmodelbydownloadingthewebpagescorrespondingtotheseedURLs.WethenextractdatesfromtheseedURLsandtheseedwebpages.Todothis,wefirsttrytoextractthepublicationdatefromtheseedURLsusingapre-definedregularexpression.Ifthatfails,weextractthepublicationdatebyparsingapre-definedsetoftagsfromtheHTMLofthewebpages.
Figure6Stepsofbuildingeventmodelfromseedwebpages
Forthedateextraction,wehaveuseda library1forextractingpublishingdateofawebpageusingheuristics.The first step is to extract thepublishingdate from theURL using regular expressions, if applicable. For example, the URLhttp://www.cnn.com/2016/07/10/us/black-lives-matter-protests/index.html hasapublishingdateofJuly10,2016.IftheURLdoesn’tcontaindateinformation,thenthenextstepistolookforspecifictagsintheheaderportionofthecorrespondingwebpageHTMLtags.Anexampletagthatcontainspublishingdatelookslike:
1https://github.com/Webhose/article-date-extractor
29
<metaname="pubdate"content="2015-11-26T07:11:02Z">
ThisappearsintheheadtagoftheHTMLcontentofawebpage.Therearemultiplemetatagsthatmightcontainpublishingdate;hencethelibraryhasanextensivelistofpossiblemetatagsthatarefrequentlyusedindifferentwebsites.Thefinalstepistocheckinthebodyofthewebpage,ifnopublishingdateisfoundintheheadtag.Asbefore,alistoffrequentlyusedbodytagsisusedtoguidefindingthepublishingdate.Anexampleofsuchatagcontainingpublishingdateis:
<pclass=”pubdate”>Sept3,2011</p>Iftherearemultipledatesfoundinthewebpage,thelibraryreturnsthefirstoneonly.ThelibrarytriestoextractthepublishingdatefromtheURL,thenfromtheheadtagoftheHTML,andthenfromthebodytagoftheHTML.TheorderisimportantbecauseitfollowstheaccuracyoftheextracteddatewheredateextractedfromURLisexpectedtobemoreaccuratethanfromtheheadthanfromthebody.Anextrastepthatcouldbedoneistousenaturallanguageprocessingtechniquestoextractnamedentities(dates)fromthetextualcontentofthewebpage.Usingextractednamedentitiesdates,wecanfigureoutthepublishingdateofthewebpage.However,wehavenotusedthisapproach,becausewiththelibraryweusedwemanagedtoextractpublishingdatesfrommostofthewebpagesandbecauseoftheoverheadofcallingandusingthenamedentityrecognizer.Fortheeventmodellocationsvector,wesegmentthetextofthewebpagesoftheseedURLsintosentencesandapplytheStanfordNamedEntityRecognizer(SNER)2oneachsentencetoextractlocationentities.Wethenperformfrequencyanalysisontheextractedlocationentitiesandconstructthelocationsvector.Itincludestheuniquelocationsextracted,alongwiththeirfrequencyofoccurrenceinallsentencesinallseedwebpages(i.e.,theweightofeachlocationisthecumulativefrequencyinallseedwebpages).TheresultinglocationsvectorwillincludethelocationsfrequentlymentionedinthesetofwebpagescorrespondingtotheseedURLs,whichshouldbethelocationoftheeventofinterest,assumingtheseedwebpagesarerelevantandofhighquality(withregardtocontainingenoughinformationaboutthedifferenteventaspects,namelytopic,location,anddate).TheSNERcouldextractlocationentitiesnotrelatedtotheevent from some of the seedwebpages, as a webpagemay include references to
2http://nlp.stanford.edu/software/CRF-NER.shtml
30
multiple locations.This shouldnot affect themodel, however, as the frequencyofthoselocationentitiesshouldbeverysmall(sincetheytypicallyappearinfewoftheseed webpages). On the other hand, if the event occurs in multiple locations, asuitablelistoflocationsshouldbefoundthroughtheabovementionedprocessingofseedwebpages (i.e., theseedwebpagesshouldcover thedifferent locationsof theeventandnotbeconcentratedononelocationonly).Forthetopicvector,weperformthesameprocessingasforthebaselineVSM.WetokenizethetextofthewebpagesoftheseedURLsintowords,removestopwords,stemwords,performfrequencyanalysis,andconstructthetopicvectorasthesetofuniquetermsalongwiththeirfrequencyofoccurrence.
4.2 EventProcessing
4.2.1 EventModel-basedWebpageScoringInthissection,weshowhowtheeventfocusedcrawlerusestheeventmodeltocalculateascoreforeachwebpageitvisitsandfortheURLsextractedfromthatwebpage.Focusedcrawlersassigneachdownloadedwebpageascore,whichestimatestherelevanceofthewebpage.Inthecaseofevents,eventaspectsareconsideredduringtherelevanceestimationprocess.Thuswescoretherelevanceofthewebpagewithrespecttoeachaspectoftheevent,andthencombinethatinformationtocomputeafinalscore.Accordingtooureventmodel,therearethreeattributeswhichtogetherfullydescribeanevent.Awebpagecanhavesomeoralloftheattributesofanevent.Awebpageisconsideredrelevant(i.e.,talksaboutthetargetevent)ifitsatisfiesthefollowingconditions:
• Ithasanon-emptysubsetofthekeywordsthatrepresentthetopicattributeofthetargetevent(i.e.,istopicallyrelevant).
• Itspublicationdateisclosetotheeventdate.• Ithasanon-emptysubsetofthekeywordsofthelocationattributeofthe
targetevent(i.e.,thelocationentitiesextractedfromthewebpagearesimilartoeventlocationentities).
Awebpagethatsatisfiestheseconditionsshouldbeconsideredrelevantandwillbeaddedtotheoutputcollection.Theeventfocusedcrawlerfirsttakesthefollowingstepswithregardtoawebpage:
1. Extractthetextofthewebpage.
31
2. Extractthepublicationdateofthewebpage.3. ExtractlocationentitiesfromthetextofthewebpageusingNamedEntity
Recognition(NER).Wehavedevelopedafunctiontomeasurethesimilaritybetweenthetargeteventmodelandthewebpagemodel.Thesimilarityfunctionproducesascorethatestimatestherelevanceofthewebpagetothetargetevent.Givenatargeteventmodelandawebpageeventmodel:e1=(T1,L1,D1)ande2=(T2,L2,D2),whereT1istheeventtopicreferencevector,L1isthelistoflocationentitiesextractedfromseedwebpagesusingNER,D1istheeventdate,T2isthebag-of-wordsvectorrepresentationofthewebpagetext,L2isthelistoflocationentitiesextractedfromthewebpagetextusingNER,D2isthepublicationdateofthewebpage,e1isthetargeteventmodel,ande2isthewebpageeventmodel.Thesimilarityfunctionsim(e1,e2)isdefinedas:
!"# $%, $' = *×!,-.$ /%, /' + 1×!,-.$ 2%, 2' + ,×!,-.$(4%, 4') (1),
where
!,-.$ /%, /' = 6(78)×6(79):;<8∩<9
>8 × >9 (2),
i.e.,thecosinesimilaritybetweentheT1andT2vectors,andw(ti)istheweightoftermtindocumenti,and
!,-.$ 2%, 2' = 6(?8)×6(?9)@;A8∩A9
B × B9 (3),
i.e.,thecosinesimilaritybetweentheL1andL2vectors,andw(ti)istheweightoflocationlindocumenti,and
!,-.$ 4%, 4' = 1 −E8FE9
GHI_KLMN (4),
32
wherenum_daysisthenumberofdaysinayear.Thisparametercanbeconfiguredaccordingtotheeventcharacteristics.Ifthetwodatesaremorethanthevalueofthisparameter,thescorewillbezero.Thefinalscoreofthewebpageiscalculatedbyusingaweightedaverageofthescoresofthetopic,location,anddatevectors,whereconstantsa,b,andcaretheweightsofthetopic,location,anddatescores,respectively.Theseaddtoone:a+b+c=1.
4.2.2 CalculatingTheWeightsTheweightsa,b,andccouldbesetmanuallybyanexpertwhowouldtakeintoconsiderationthetypeoftheevent(shooting,hurricane,bombing,earthquake,etc.)andthecharacteristicsoftheevent(timedurationandlocationarea,e.g.,specificlocationandpointintimefora“sharp”event,versusmultiplelocationsandlongtimeperiodsforcomplexevents).Toautomaticallycalculatetheweights,weuseeachaspectoftheeventmodel(topic,location,anddate)separatelytoscoreasampleoflabeledwebpages.Weevaluateeachaspect’sperformanceagainstdifferentthresholdvalues,andwechoosethethresholdvaluethatproducesthebestclassificationperformanceaccordingtoagivenevaluationmetric.Inparticular,weusedtheF1-scoreastheclassificationevaluationmetric.F1-scoreisaninformationretrievalmetricthatcalculatesthegeometricmeanoftheprecisionandrecall[88].WealsousedtheF1-scoretoassigntheweightofeachaspectoftheeventmodel(topic,location,anddate),whichindicatestheimportanceofthataspectincalculatingthefinalscore.Wecalculatetheweightastheratiooftheaspect’sF1-scoretothesumoftheF1-scoresofallaspects.Figure7showsthestepsforcalculatingthescoreofawebpage.For example, assume we have 100 webpages, 50 relevant and 50 non-relevant,relative to a specific event (e.g.,Orlando shooting). Assumealso thatwehave thetarget event model (which could be extracted from another set of relevantwebpages or entered manually by the user). For each of the 100 webpages, weextract the topic vector, locations vector, and publication date. Thenwe calculatethree scores (topic score, location score, and date score) for eachwebpage usingequations1,2,and3,respectively.Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(topicscore,locationscore,anddatescore).Next,we use each of the scores (topic, location, and date) separately to predict a label
33
(relevant or non-relevant) for eachwebpage.Weproduce the label by comparingthescoretoathresholdvalue(callitK);wegenerate“relevant”ifthescoreislargerthanthethresholdand“non-relevant”ifitissmaller.
Figure7Thestepsforcalculatingthescoreofawebpage
Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(labelbasedontopicscore,labelbasedonlocationscore,andlabelbasedondatescore).Thenweevaluatetheeffectivenessofeachaspectoftheevent(topic,location,anddate)bycomparingtheactuallabelsandpredictedlabelsforeachofthethreeaspects(topic,location,anddate).WeusetheF1-scoreasthemetricforevaluation.Hereweendupwith3F1-scores(oneforeachofthetopic,location,anddate)forthethresholdvalueK.Werepeatthepreviousprocessfordifferentvalues(nvalues)ofthethresholdparameter.Thenweendupwithamatrixofnrows(differentvaluesofthethreshold)and3columns(topic,location,anddate).Finally,wechoosethemaxF1-scoreforeachaspect(topic,location,anddate).TheweightofeachaspectwillbetheratioofitsF1-scoretothesumofthreeaspects’F1-scores.Inthismanner,theweightofeachaspectcorrespondstohowmuchitcontributestotheoverallperformance.Theweightsofeachaspectoftheeventmodel(topic,location,anddate)arelearntbeforethecrawlingtimebyapplyingthepreviousprocedureonagivensetofURLs
34
andwebpagesthatarelabeledasrelevantornon-relevant.Theweightslearntareusedduringcrawlingandarenotmodified.
4.2.3 EventModel-basedURLScoringAsimilarprocedureisimplementedforestimatingascoreforeachURLextractedfromthewebpage.AURLisconvertedintotokensbyremovingnon-alphabeticcharacters(like‘/’,‘#’,’?’)andalsoremovingURL-specifickeywords(like‘http’,‘com’,’www’).URLtokensarecombinedwithtokensfromassociatedanchortexts.Theresultingtokensarethenconvertedtoabag-of-wordsbasedvectorrepresentation.WeextractthelocationentitiesfromURLanchortextusingSNERand extractthepublicationdatefromtheURLusingregularexpressions3(ifapplicable).
Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted
Figure8showsanexamplewebpageabouttheBrusselsattackeventwitharelevantURLhighlighted.TheanchortextoftheURLis:“ParisandBrusselsterrorsuspecttofacechargesinFrance”.TheaddressthattheURLpointstois:3https://github.com/Webhose/article-date-extractor
35
https://www.theguardian.com/world/2016/jun/09/mohamed-abrini-paris-brussels-terror-suspect-france-man-in-the-hat.WecanseethattheURL’saddresscontainsthepublicationdateofthecorrespondingwebpage(June9,2016).TheURLvectoraftertokenization,stopwordremoval,andstemmingwouldbe:[‘theguardian’,‘world’,’2016’,’jun’,’mohamed’,’abrini’,’paris’,’brussels’,’terror’,’suspect’,’france’,‘man’,’hat’,’charges’].WenoteherethatSNERwillcapturethelocationentitiesfromtheURL’sanchortextonly,nottheURLaddress,becausetheSNERtokenizesitsinputtosentencesandtriestoextractentitiesfromthesesentences;thiscanbedonefortheanchortext(whichincludesmeaningfultext)butnotfortheURL(sinceURLtokensdon’tformmeaningfulsentences).IftheseedURLsarefromonedomain(e.g.,www.theguardian.com),thismayaffectthequalityofinformationextractedtobuildoureventmodel.Thecoveragefromasingledomaincouldbebiasedorlimitedinscope,whilethatislesslikelyiftherearemultipledomains.TheseedURLsshouldbefromdifferentdomains,whichwillensurethatallrequiredinformationispresentintheeventmodelandthusthereisnobiastowardaparticulardomainwebsite.Asasummary, Figure9showsthestepsforcalculatingthescoreofaURL.
Figure9ThestepsforcalculatingthescoreofaURL
36
Inthischapter,wehaveaddressedhypothesis1.1andresearchquestions1and2,namely“Howtomodelandrepresentanevent?”and“Howtocomparetwoeventrepresentations?”.Weexplainedoureventmodelandrepresentation,andhowtheeventfocusedcrawlercanuseittoestimatetherelevanceoftheURLsandwebpagesitvisits.
37
5 ExperimentalSetupInthischapterwedescribethedatasetsusedforourexperiments,thedifferentexperimentsperformed,andtheevaluationmetrics.Weperformedfourseriesofexperiments.Thegoalofthefirstseries(seeresultsinSection6.1)wastovalidatethatoureventmodelcaneffectivelyclassifywebpagesasrelevantornon-relevanttotheeventofinterest.Wecomparedourapproachagainstthebaseline,i.e.,usingthetraditionalvectorspacemodel(VSM)topic-onlyapproach.Inthesecondseriesofexperiments(seeresultsinSection6.2),weaimedtovalidatethattheeventmodelcaneffectivelyestimatethescoresoftheURLsandwebpagesitvisitsandconsequentlyguidethecrawlingprocesstowebpagesrelevanttotheeventofinterest.Inthethirdexperiment(seeresultsinSection7.1),weevaluatedtheeffectivenessofourproposedwebpagesourceimportancemodelforcollectingmorerelevantwebpages.Inthefourthexperiment(seeresultsinSection7.2),weevaluatedtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.
5.1 DatasetsForthefirstseriesofexperiments,aboutclassification,wedevisedtwodatasets:1)asetofrelevantwebpagesforthetraining/learningmodelphase,and2)asetofrelevantandnon-relevantwebpagesforthetesting(classification)phase.First,weneedasetofrelevantwebpagesthatthetwomodels(ourevent-basedmodelandthetopic-onlymodel)willusetolearn/buildtheirmodelinthetrainingphase.Accordingly,wemanuallycuratedasetof38URLsandfetchedtheircorrespondingwebpages.Fortheclassificationphase,therewasnoexistingdataset(labeledrelevantandnon-relevantsamples)aboutthatshootingevent.Wedecidedtobuildourowngroundtruthdatasetof1000URLsandwebpages.Wecouldhavemanuallylabeledasetof1000URLsandwebpages,but(tosavetimeandeffort)weusedakeyword-basedcrawlertofetch1000webpagesusingthesetof38URLs(usedinthetrainingphase)asseeds.Weusedthetwowords“California”and“shooting”askeywordsforthecrawler.Afterthecrawlerfinishedcrawling,wemanuallylabeledtheresultingwebpagesintotwoclasses(relevantandnon-relevant).Therewere725webpages
38
labeledasrelevantand275labeledasnon-relevant.WefollowedtheproceduredescribedinSection4.3forcalculatingtheweightsandperformingtheevaluation.ThemanuallylabeledsetofURLsandwebpagesaregivenasinputtobothmodelsandtheresultingcalculatedscoresandgiventhresholdparameterareusedtoproducethepredictedlabels.Thepredictedlabelsarethencomparedtothelabelsdeterminedmanually,toproducetheevaluationresults.After completing the first series of experiments, for the focused crawlingexperiments,weconsideredasetofrecentevents.Table2showsthelistofeventsused in our crawling experiments. For each event we summarize the type of theevent,thelocationanddateoftheevent,andfinallyhowmanyURLswereextractedfromthecorrespondingtweetcollectionandwereusedasseedURLsforcrawling.The number of seed URLs varies across the events because they were extractedfromtheevent’scorrespondingtweetcollections.Thenumberofdesiredwebpageswassetto50,000foreventswithlessthan10,000seedURLsand100,000foreventswithmorethan10,000seedURLs,exceptforEgyptairplanecrasheventwherethenumber of desiredwebpageswas set to 10,000URLs only andParis attack eventwhere thenumber of desiredwebpageswas set to 500,000. For the classificationexperiments,however,itseemedsufficienttojustconsidertheCaliforniashooting.
Table2Listofeventsusedinthelarge-scalecrawlingexperiments
Event Type Location Date #ofSeedURLs
#ofdesiredwebpages
CaliforniaShooting
Shooting SanBernardino,California,USA
December2,2015
4,161 50,000
BrusselsAttack
TerroristAttack
Brussels,Belgium
March22,2016
4,691 50,000
OregonShooting
Shooting Roseburg,Oregon,USA
October1,2015
22,354 100,000
EgyptairPlaneCrash
PlaneCrash MediterraneanSea,
Alexandria,Egypt
May19,2016
1,211 10,000
PanamaPapersLeak
DocumentLeak
Panama April3,2016
18,260 100,000
OrlandoShooting
Shooting Orlando,Florida,USA
June12,2016
1,988 50,000
ParisAttack TerroristAttack
Paris,France November13,2015
88,835 500,000
EcuadorEarthquake
Earthquake Ecuador April16,2016
11,348 100,000
39
Sincethe IDEALproject isworkingwitha largeamountofevent-relateddata,andsincewewantedourresultstobeassessedinthecontextofsuchtypesofdata,wehadampleopportunitytoutilizedatafromeventslikethosementionedabove.
5.2 ExperimentsThegoalofthefirstserieswastovalidatethatoureventmodelcaneffectivelyclassifywebpageswithregardtorelevancetotheeventofinterest.Wecomparedtheperformance,forthetaskofclassification,oftheevent-modelvs.topic-onlymodel.WeusedthemanuallycuratedseedURLsandthestaticdatasetof1000webpagesabouttheCaliforniashootingfortheevaluation.Bothmodelsusecosinesimilarityasascoringfunction,andproduceascorefortheirinputtext(URLorwebpage)thatestimatestherelevanceoftheinputtothegivenevent.Weuseeachofthemodelsasaclassifier,i.e.,bycomparingthecosinesimilarityscoretoagiventhresholdparameter.Ifthecosinesimilarityscoreisbiggerthanthethreshold,thentheoutputlabelisrelevant,butisnon-relevantotherwise.Bothmodels(event-basedandtopic-only)aretrainedusingasetofURLs(positiveonlysamplesastheydon’trequirenegativesamplesfortraining).Thetopic-onlymodelusesthesetofURLstobuildatopicreferencevector,whiletheevent-basedmodelusesthesetofURLstobuildtheeventmodel(topic,location,anddate).ThenextstepistousethemodelstoclassifyURLsandwebpages.Weusethetwomodelsbuilt(event-basedmodelandtopic-onlymodel)toclassifythemanuallylabeledURLsandwebpages(i.e.,ourgoldenstandardtestset)todeterminetheirpredictedlabels.Thelaststepistocomparethepredictedlabels(fromboththeevent-basedmodelandthetopic-onlymodel)totheactual(manuallyproduced)labelsandevaluatetheperformanceofeachofthemodels.Weevaluatedtheperformancebyvaryingtwoparameters:
1. k,thenumberofkeywordsusedinconstructingthetopicvectorinoureventmodelandthetopicreferencevectorforVSM,and
2. threshold,thevalueofthethresholdusedforconvertingthescorestolabels(relevantifthescoreislargerthanthethreshold,otherwisenon-relevant).
Wealsorantheexperimentswithseveralvariationsofoureventmodel.Wethenranthesameexperimentwiththetwopairsoffeaturetypes:a)combinationofthetopicandlocationonly,andb)combinationofthetopicanddateonly.Figure10showsthedesignoftheexperimentsforevaluatingtheeffectivenessofclassificationusingtheeventmodel.
40
Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel
forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment
Inthesecondseriesofexperiments,weaimedtovalidatethattheeventmodelcaneffectively estimate the scores of the URLs and webpages to be visited, andconsequentlyguidethefocusedcrawlingprocesstowebpagesrelevanttotheeventofinterest.Intheseexperimentsweusedasetofrecentevents.Intheliterature,themostuseddataset forevaluating topical focusedcrawler is theDMOZdataset.Butsince we are evaluating focused crawlers for events, the DMOZ dataset is notsuitableinourcase.Inthecrawlingexperiments,weneedasetofrelevantURLsforeacheventthatwillbeusedasseedsforstartingthecrawling.WemanuallycuratedthesetofseedURLsfortwoevents(38forCaliforniashootingand23forBrusselsattack).ThesetwosetsofmanuallycuratedURLsareusedforrunningsmall-scalecrawlsonly.Forlarge-scalecrawls,weneedtostartfromalargersetofseedURLs,whichisverydifficulttobuildmanually.Fortherestoftheevents,weextractedthesetofseedURLsfromacorrespondingcollectionoftweets.ThetweetcollectionswerecollectedusingtheTwitterstreamingAPI.TwitterisarichsourceofURLs,asmostofthetweetspostedlinktowebpagesthatcontainmoredetailedinformation.ThenumberofURLsextracteddependsonhowbigtheeventis(highimpacteventsattractmorepeopleandthereforemoretweetsarepostedabouttheevent).
41
Figure11summarizesthestepsforcalculatingtheharvestratioafterthecrawlingprocessfinishes.Theresultingwebpagesofeachcrawl,theircalculatedscores(basedoneachmodel),andagiventhresholdareusedtoproducethepredictedlabelsandthecorrespondingharvestratio.
Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofevent
modelwithfocusedcrawlertoretrievemorerelevantwebpages
5.3 EvaluationMetricsWeusedtheprecision,recall,andF1-scoremetricstoevaluatetheclassificationperformanceofoureventmodelversusthebaselinetopic-onlyapproachinthefirstseriesofexperiments.Theprecision,recall,andF1-scorearecalculatedusingtheconfusionmatrix,whichcontainsthenumberoftruepositive,truenegative,falsenegative,andfalsepositivesamples(sampleshererefertowebpages).Thetruepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallyrelevant,whilethetruenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallynon-relevant.Thefalsepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallynon-relevant,whilethefalsenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallyrelevant.Theprecisionisdefinedasthepercentageofretrievedsamplesthatarerelevant.Itiscalculatedastheratiooftruepositivetothesumofthetrueandfalsepositivesamples.Therecallisdefinedasthepercentageofrelevantsamplesthatareretrieved.Itiscalculatedastheratioofthetruepositivetothesumofthetruepositiveandfalsenegativesamples.TheF1-scoremeasureisthegeometricmeanoftheprecisionandrecallmeasures.Togetaperfectprecisionweusuallymusthave
42
lowrecallandviceversa(perfectrecallcomeswithlowprecision),soF1-scoreisawaytocombinebothmeasures(precisionandrecall).Inotherwords,ifyougetahighvalueforprecision,thatdoesn’tensureyouhavegoodperformance,asyoumighthavealowvalueforrecall.ButifyouhaveahighvaluefortheF1-score,thismeansyouhaveahighvalueforbothprecisionandrecall.Forevaluatingtheperformanceofthefocusedcrawlersinthesecond,third,andfourthseriesofexperiments(i.e.,theabilitytocollectmorerelevantwebpages),weusedtheharvestratiometric[27-29].Theharvestratioisthepercentageofcrawledwebpagesthatarerelevant.Theharvestratiomeasurestheabilityofthecrawlertofindandcollectmorerelevantcontentthannon-relevantones.Ifthecrawlervisitsmanynon-relevantwebpagesinordertofindrelevantones,thenthismeansthecrawlerisusinganinefficientmethod.AhighlyefficientrelevanceestimationmethodwoulddirectthecrawlercorrectlyandensureitfocusesontherelevantpartoftheWebonly,thusproducingahigherharvestratio.ItisworthnotingherethatthecriticalpointforrelevancejudgmentistheabilitytoestimaterelevanceofbothURLsandwebpages.EstimatingtherelevanceofwebpagesiseasierthanforURLsduetothefactthatwebpagescontainrichertextualcontentthanURLs.Apoorrelevanceestimationmethodwouldgivehighscoresfornon-relevantURLs,whichleadstonon-relevantwebpagesandthuslowharvestratio.
43
6 Results6.1 EventModel-basedvs.Topic-OnlyClassificationInthissection, aboutthefirstseriesofexperiments,weshowtheresultsofclassifyingthe1000webpagesabouttheCaliforniashootingusingthetopic-onlyvectorspacemodelversusthreevariantsofoureventmodel,namelytopic+location,topic+date,andtopic+location+date(oureventmodel).Usingthe38seedwebpagesabouttheCaliforniashootingevent,wecreatedavocabularyof1365keywordsthatappearedon5ormorewebpages.Toextractthemostrepresentativekeywords(features)fromthevocabulary,wesortedthevocabularykeywordsbasedontheircumulativenormalizedoccurrencesinalloftheseedwebpages.Wechosethetopkkeywordsfromthesortedvocabulary.Eachseedwebpageisthenrepresentedasavectorofthetopkkeywordsandtheirfrequencyofoccurrenceinthewebpage.Wecreatedthetopicreferencevectorasthecentroidvectorofallseedwebpagevectors.
6.1.1 ClassifyingURLsandwebpagesaboutCaliforniashootingThe1000URLs/webpagesdatasetabouttheCaliforniashootingconsistsofURLaddressesandtheiranchortexts,aswellasthecorrespondingwebpages.WeranexperimentstoclassifyURLsandwebpages,separately.TheURLsandwebpagesweremanuallylabeledastorelevantvs.non-relevant.Werefertothisdatasetasthelabeleddataset.Forthetopic-onlymodelandoureventmodel,wevariedtheparameterk(thenumberofkeywordsinthetopicvector)from5to1365wordswithincrementof10andthethresholdparameterfrom0to1withincrementof0.05.Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe
topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore
URLs WebpagesK Threshold K Threshold
Topic-only(baseline) 1310 0.25 10 0.45Topic,Location,andDate(Eventmodel) 1310 0.15 10 0.4
AscanbeseeninTable3,forthetopic-onlyapproach,inthecaseofURLs,thevaluesoftheparametersk(thenumberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatasetwere1310and0.25,respectively.Inthe
44
caseofwebpages,theywere10and0.45,respectively.Theweightsoftopic,date,andlocationpartswere0.36,0.22,and0.42,respectively.Theweightswerecalculatedusingequations1-4asdescribedinSection4.2.2.Fortheeventmodelapproach,inthecaseofURLs,thevaluesoftheparametersk(numberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatawere1310and0.15,respectively.Inthecaseofwebpages,theywere10and0.4,respectively.Theweightsoftopic,date,andlocationpartswere0.3,0.355,and0.345,respectively.Table3summarizestheparametervaluesforallsettings.Toexaminetheeffectofthedateandlocationseparately,weranourevaluationusingtopic+locationandtopic+date.Fortopic+location,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.64and0.36,respectively.Fortopic+date,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.47and0.53,respectively.Tables4and5showtheprecision,recall,andF1scoreforthefourexperimentalsettings(topic-only,topic+location,topic+date,andoureventmodel,withtopic,location,anddate)usingthebestvaluesfortheparametersforboththeURLandwebpageclassificationtasks (asshowninTable3).Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCalifornia
shootingevent
Precision Recall F1-scoreTopic 0.728 0.723 0.725Topic+Date 0.852 0.855 0.853Topic+Location 0.764 0.73 0.74Topic+Location+Date 0.863 0.867 0.862
AchievinghigherF1-scoremeansbetterclassificationperformance(i.e.,betterabilitytoidentifyanddifferentiatebetweenrelevantandnon-relevantwebpages).Theresultsshowthataddingdateand/orlocationinformationtothetopicenhancestheperformance.Oureventmodel(combiningtopic,location,anddate)achievesthebestperformance(highestF1-score).Thetopic-onlymodelperformedworst(lowestF1-score). Ourexaminationofthedataconfirmedthatthetopic-onlymodeldidnotdifferentiatewellbetweenwebpagestalkingaboutdifferentshootingeventsandourevent(Californiashooting),asallaretopicallyrelated(shooting).Ontheotherhand,thetopic+datemodelperformedbetterthantopic-only,becauseitmanagedtousethepublishingtimeofthewebpagestofilteroutwebpagestalking
45
aboutshootingeventsthathappenedbeforetheCaliforniashootingevent.Thetopic+locationmodelperformedbetterthanthetopic-onlymodelbecauseitfilteredoutwebpagestalkingaboutshootingeventsthathappenedatotherlocationsthanCalifornia.Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,
andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent
Precision Recall F1-scoreTopic 0.738 0.734 0.736Topic+Date 0.842 0.846 0.843Topic+Location 0.856 0.859 0.857Topic+Location+Date 0.88 0.884 0.881
Wealsoexaminedtheperformanceofoureventmodel(combiningtopic,location,anddate)versusthetopic-onlyapproachacrossthedifferentvaluesofthethresholdvariablesandthebestvalueforthekparameter.Weplottedtheprecision-recallcurvesforthedifferentvaluesofthethresholdparameter.Figure12andFigure13showthecurvesforthefourdifferentsettings.Thefiguresconfirmtheresultsdescribedabove:addinglocationand/ordateinformationenhancestheperformanceofclassification.Itisalsoshownthattheeffectofaddingdateinformationismuchstrongerthanaddinglocationinformation,inthecaseofURLs.WeinvestigatedthisbehaviorandfoundthatmostoftheURLsinourlabeleddataincludedateinformationthatcanbeextractedeasily.TherewaslesslocationinformationintheURLscomparedtodateinformation.Further,someofthelocationinformationwasnotinastandardformatasexpectedbySNER(whichassumeslocationinformationexistsaspartofavalidsentence;seeSection4.2.3).WehaveusedthemanuallycuratedseedURLsforlearningoureventmodelandthebaselinetopic-onlymodel.Bothmodelshavetwoparameters:K(numberoffeatures,i.e.,words,inthetopicvector)andthreshold(valuefordeterminingthelabels:relevantornon-relevant).Wetunedthevaluesofthetwoparametersandreportedthebestvaluesofthoseparameters(seeTable3)andtheperformanceofthetwomodels(usingthebestvalueofthetwoparameters;seeTables4and5)onthemanuallylabeledtrainingdataset.Finally,wetestedtheperformanceofthetwomodelsonthemanuallylabeledtestdataset(sincethetwomodelshaven’tseenthewebpagesinthisdataset)toseehowwellthetwomodelswillgeneralizetounseenwebpages.Table6showstheperformanceofthebaselinetopic-onlymodel,oureventmodel(topic+location+date),andtwovariantsofoureventmodel(topic+locationandtopic+date).Oureventmodeloutperformsthebaselinetopic-onlymodelbyachievinganF1-scoreof0.894comparedto0.688forthebaselinetopic-onlymodel.Also,addingthedateorlocationinformationachievesbetterperformancethanthebaselinetopic-only
46
model.Addingthelocationinformationismoreeffectivethanaddingthedateinformation.Thiscanbeattributedtotherichnessoflocationinformationinthewebpagescomparedtotheexistenceofpublicationdateinwebpages.Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCalifornia
shootingevent
Precision Recall F1-scoreTopic 0.695 0.706 0.688Topic+Date 0.779 0.783 0.776Topic+Location 0.858 0.86 0.858Topic+Location+Date 0.899 0.896 0.894
Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues
47
Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues
48
6.2 EventModel-basedvs.Topic-onlyFocusedCrawlerInthissection,aboutthesecondseriesofexperiments,wereporttheeffectofusingtheeventmodelwiththefocusedcrawler.
6.2.1 CaliforniaShootingInthisexperimentweusedthe38URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable7.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure14showstheperformanceofthetwocrawlersinthesmall-scalesetting(1000webpagesonlyarecrawled)duringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringandattheendofthecrawlingprocessthanthebaselinetopic-onlyfocusedcrawler.Oureventmodel-basedfocusedcrawlerachievedapproximatelyaharvestratioof0.85whilethebaselinetopic-onlyfocusedcrawlerachievedapproximatelyaharvestratioof0.68.
Table7 Californiashootingeventmodel
Keywords Weight
Topic
shoot 0.93 san 0.513 bernardino 0.465 said 0.357 wa 0.323 2015 0.321 peopl 0.31 california 0.305 polic 0.258 suspect 0.177
Location San Bernardino 1 California 0.51 Calif. 0.44
Date 2015-12-02
49
Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused
crawlersforCaliforniashooting
Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)for50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure15showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.3(average).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
Percentage4of4crawled4webpages4that4are4relevant4
(Harvest4Ratio)
Total4number4of4webpages4crawled
Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4
Crawler4for4California4 Shooting4Event
Baseline4Focused4Crawler4H Topic4Only Event4ModelHbased4Focused4Crawler4H Topic4+4Loc4+4Date
50
Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforCaliforniashooting(50Kwebpages)
6.2.2 BrusselsAttackInthisexperimentweusedthe23URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable8.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure16showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringthecrawlingprocessthanthetopic-onlybaselinefocusedcrawlerinthesmall-scalesetting(1000webpagesonly).
51
Table8 Brusselsattackeventmodel
Keywords Weight
Topic
brussel 0.881 attack 0.541 airport 0.539 explos 0.381 wa 0.31 peopl 0.273 station 0.254 belgium 0.242 metro 0.197 terror 0.159
Location
Brussels 1 Belgium 0.37 Brussels Airport 0.174 Zaventem 0.174 Paris 0.123
Date 2016-03-22
Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor
Brusselsattack
Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure17showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-based
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
Percentage4of4crawled4webpage4that4are4relevant4
(Harvest4Ratio)
Total4nubmer4of4crawled4webpages
Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4
Crawler4for4Brussels4Attack4Event
Baseline4Focused4Crawler4H Topic4Only Event4modelHbased4Focused4Crawler4H Topic4+4Loc4+4Date
52
focusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.5.
Figure17Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforBrusselsattack(50Kwebpages)
Intheprevioustwoexperiments(CaliforniashootingandBrusselsattack),weperformedsmall-scaleandlarge-scalecrawls.Thereasonforthesmall-scaleexperimentswastoprovethepointthattheeventmodelefficientlyguidedthefocusedcrawlertofocusontherelevantpartoftheWeb,andthecrawlerasaresultretrievedmorerelevantwebpagesthanthetopic-onlybaselinefocusedcrawler.Thepurposeofthelarge-scaleexperimentsistoshowthattheperformanceofoureventmodel-basedfocusedcrawlerremainsbetterthanthetopic-onlybaselinefocusedcrawlerevenforlargenumbersofwebpages,i.e.,isscalable.Intheremainingexperiments(otherevents),weranonlythelarge-scalecrawlingexperiments.Wealreadyshowedtheeffectivenessofourapproachinthesmall-scaleexperimentsandweneedtovalidatethatthesameperformancepersistsat
53
largescale.Wedidn’tmanuallypreparesetsofseedURLsfortheremainingevents;weextractedthemfromthetweetcollectionsforeachevent.
6.2.3 OregonshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~22KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.25(average).Figure18showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.
Figure18Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforOregonshooting(100Kwebpages)
Inthisexperiment,wehadahighernumberofseedURLsthanintheprevioustwoexperiments(22Kcomparedto4K).WedidnotcontrolthenumberofseedURLs.WeextractedandfilteredtheseedURLsfromtheevent’stweetcollection.ThenumberofseedURLsdependsonthesizeofthecorrespondingtweetcollection,whichdependsontheimpact/coverageandsizeoftheevent(bigorsmall).EventswithbigimpactwillleadtolargetweetcollectionsandthereforemoreURLsextracted.Thedefinitionofimpacthereinourcontextisrelatedtocoverage.The
54
Oregonshootingevent’stweetcollectionhadmoretweetsthantheprevioustwoeventsandthereforethereweremoreURLsextractedthantheprevioustwoevents.HavingmorestartingURLsmeanswehavemoreaccess/pointerstotherelevantpartoftheWebgraph,sowerantheexperimentsfor100Kwebpagesratherthan50K(likeinthefirsttwoexperiments).
6.2.4 EgyptairplanecrashWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect10Kwebpages.Thetwocrawlersstartedfromasetof~1100seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.5whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure19showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.
Figure19Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforEgyptairplanecrash(10Kwebpages)
Aneventofbigimpactcouldhavesmallsizetweetcollection(asthecaseforthisevent)becausetherewasnotenoughcoveragefortheeventonTwitter,ortheeventdidn’tattractmuchattentionfromTwitterusers.Anotherpossiblereasonistherewereothermoreattracting/trendingtopicsthatattracted/drewattentionawayfromthatevent.
55
6.2.5 PanamapapersWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~18KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure20showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.
Figure20Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforPanamaPapers(100Kwebpages)
6.2.6 OrlandoshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~2000seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure21showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.
56
Figure21Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforOrlandoshooting(50Kwebpages)
6.2.7 ParisattacksWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect500Kwebpages.Thetwocrawlersstartedfromasetof~88KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.25whilethebaselinetopic-onlyachievedaharvestratioof0.18.Figure22showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.TheParisattackeventwasabigeventwithhugeimpactlocallyinFranceandinternationally.Wekeptcollectingtweetssincethestartoftheeventandforseveraldaysaftertheevent,whichisnotthecasefortheotherevents.Usuallythetweetcollectingprocessstopsonthesamedayorone/twodaysaftertheevent.ThestoppingpointdependsonthetimeTwitterusersstoppostingabouttheevent,whichtypicallypeaksonthedayoftheeventanddecreasesafterthat.Wenoteherethatthisexperiment(andallourexperiments)startedfromtheEnglishseedURLsonly(thesameappliesduringthecrawlingprocess;weare
57
workingontheEnglishlanguageonly).WeexcludedalltweetsnotinEnglish.EvenwhenlimitingtoEnglishonlytweets,westillhadaround88KseedURLstostartfrom.Theperformanceofthetwocrawlersdegradedattheendofthecrawlasexpected,because(asknowninthetopicalcrawlerliterature[17,26,28,40,41,46,47,65,66,74,75,81,83,89,90])aswegetfarfromtheseedURLs,wefindfewerrelevantwebpages.TherelevantcontentisconcentratedaroundtheseedURLs,sothefurtherawaywegofromtheseedURLs,thegreaterthechancethatwehitnon-relevantcontent.
Figure22Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforParisattack(500Kwebpages)
6.2.8 EcuadorearthquakeWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~11KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedahigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.75whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure23showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.
58
Figure23Largescaleperformanceevaluationofeventmodel-basedvs.topic-only
focusedcrawlersforEcuadorearthquake(100Kwebpages)
Inthischapterweaddressedresearchquestion3,whichistheeffectofusingoureventmodelontheperformanceofthefocusedcrawler.Weshowedthatusingtheeventmodelwithfocusedcrawlingleadstoachievingahigherharvestratio,thuscollectingmoreoftherelevantwebpagesthanthetraditionalfocusedcrawler.Thebetterperformancewasshowninsmallandlarge-scalecrawls.
59
7 WebpageSourceImportanceandSocialmedia-basedSeedSelection
Sofarwehaveexploitedcontent-basedfeaturesforfindingrelevantwebpages.Inthissection,weexploretwootherparametersthataffecttheabilityofthefocusedcrawlertofindmorerelevantwebpages:webpagesourceimportanceandseedURLs.
7.1 WebpageSourceImportanceAwebpagesourceisthewebsitethewebpagebelongsto,forexample,thewebpagewiththeURLhttp://www.cnn.com/2014/04/18/world/asia/malaysia-airlines-plane/index.htmlbelongstothesourcewebsitehttp://www.cnn.com.Theimportanceofawebpagesourcewillbedeterminedbasedonthenumberofrelevantwebpagesthatbelongtothewebpagesource.Thisheuristicexhibitstwocharacteristicsthatshouldensureagoodestimateofthewebpagesourceimportance:
1) Followsthelikelihoodpropertiesofthedata,2) Changesdynamicallywithnewdataobservedduringcrawling.
Thefirstcharacteristicensuresthatthemeasureweareusingisrealisticanddescribestheactualdatabeingcollected.Thesecondcharacteristicshowshowthemeasureadaptstochangesinthedatabeingobserved,andalsofollowschangesinthecontentbeingpublishedontheWWW.ThereareseveralreasonsforchoosingourmethodofestimatingsourceimportanceandnotconsideringPageRankandHubandAuthority[42,43,77,78]methods(asanexampleofsourcepopularitymeasures):
1- Dynamicvs.static(fixed),PageRank,hubandauthority,andout-degreemeasuresareallstaticorfixedmeasures.Theyneedtobecalculatedoffline(i.e.,requirethewholedatasetorpartofittocalculatethevaluesandthenareusedafterthatduringcrawling).Thisissimilartoonlineandofflinelearningmethods.Offlinemethodsusethetrainingdatatobuildthemodelandthenuseit.Onlinemethodsdon’tuse/requiretrainingdata;theylearnthemodelanduseitonline/duringcrawling.So,themodelisupdatedduringcrawling.
2- PageRankandhubandauthoritymethodsaretimeconsumingandcomputationallyintensive,soitwillbetimeconsumingtoadaptthemforanonlineversion.
3- PageRankandhubandauthorityaremethodsformeasuringpopularityandqualityofwebpages.Wecanimaginethatusingthemforestimatingthe
60
importanceofasourcewithrespecttoaspecificeventislikeusinggeneralWebcrawlersforcrawlingwebpagesaboutaspecificevent.Weexpectthatsuchpopularitymeasuresaretoogeneraltobeconsideredameasureforsourceimportance.Forexample,anunpopularwebsite(source)couldbeveryrelevanttoanevent,duetoitscontentorhavinglinkstootherrelevantwebpagesabouttheevent.Alsousingtopic-orientedPageRankandhubandauthoritymethodsislikeusingtopicalcrawlersforcrawlingaboutevents,whichisnotefficient;thatisakeypointofthisdissertation.
Soweneedadynamic,simplycalculated,andevent-specificmethodforestimatingthelikelihoodoffindinganewrelevantwebpagefromasourcebyusingthelikelihoodofthesourceinthecurrentlycrawledwebpagesorthediscoveredbutnotyetvisitedURLsinthefrontier.Thisissimilartoagraph-basedalgorithmwhichlearnsfromdifferentpathsinthegraphwhetheracertainpathwillleadtorelevantwebpagesevenifweencounternon-relevantonesinthemiddle.Thus,ifawebpageisnotrelevantbutitssourcehashighprobabilityofhavingevent-relevantwebpagesthenthereisahighprobabilitythatthecurrentlow-score(non-relevant)webpagewilllinktoarelevantwebpage.Thisapproachactslikeagreedyalgorithmwherethecrawlerwillcrawlmorefromthesourcewiththehighestimportance.Thecrawlerwillkeepcrawlingrelevantwebpagefromthemostimportantsourceuntilitnolongerfindsrelevantwebpages,andswitchestoutilizeanotherimportantsource.Wecouldexperimentwithseveralcandidatemethodsforestimatingsourceimportance,usingtheinformationaboutcurrentlycrawledwebpages,liketheonesin[1],namely(thefollowinglististakenfromthepaperin[33]withchanges):
a. NegativeAbsoluteBadfunction,wherethescoreofasourceisthenegativenumberofalreadycrawlednon-relevantwebpages;
b. BestScorefunction,wherethescoreofasourceisthemaximalscoreofoneofthediscoveredbutnotyetvisitedURLsthatbelongstothesource;
c. SuccessRatefunction,wherethescoreofasourceistheratiobetweenthenumberofrelevantwebpagescrawledandthenon-relevant;theratioisinitializedwithpriorparametersαandβwhichwesetto1:score(source)=(#relevant(source)+α)/(#non-relevant(source)+β);
d. ThompsonSamplingfunction,wherethescoreofasourceisarandomnumber,drawnfromabeta-distributionwithpriorparametersαand
61
β;inthiscasewetakeasthescoretherandomvalue:score(source)=Beta(#relevant(source)+α,#non-relevant(source)+β);weinitializedthepriorsαandβwith1;
e. AbsoluteGood*BestScorefunction,wherethescoreofasourceistheproductoftheabsolutenumberofalreadycrawledrelevantwebpagesandthebestscorefunctiondescribedin(b);
f. ThompsonSampling*BestScorefunction,wherethescoreofasourceistheproductoftheThompsonsamplingfunction(d)andthebestscorefunction(b);and
g. SuccessRate*BestScorefunction,wherethescoreofasourceistheproductofthesuccessratefunction(c)andthebestscorefunction(b).
Theresultsshownin[33]indicatethatthesuccessratefunction(c)isthebestscoringfunctionwithregardtocrawlingthelargestnumberofrelevantwebpages(i.e.,thehighestharvestratio).Thuswechosetousethesuccessratefunctionasourmethodforestimatingwebpagesourceimportance.Werananexperimentwiththeeventmodel-basedfocusedcrawlerandsourceimportance.Wecombinedthewebpagesourceimportancescorewiththeeventmodel-basedrelevancescoretoproducethefinalscoreofURLs.Onepossiblemethodofcombinationismultiplyingbothscorestogether,soURLswithhighwebpageimportancescoreandhighrelevancescoreswillgetahigherfinalscore.Wenoteherethatthewebpageimportancescoreiscalculatedduringthecrawlingprocessandisnotafixedvalue,butratheradynamicvaluethatchangesduringthecrawlingprocess.Atthebeginningofthecrawl,allsourceshavethesameinitialimportancescore.Whenanewwebpageisretrievedandfoundrelevant,thenthecorrespondingwebpagesourceimportancescoreisupdated.Inthiswaythemorewefindrelevantwebpagesfromasource,themoretheimportancescoreofthissourceincreases.Figure24showstheperformanceofeventmodel-basedfocusedcrawlingwithandwithoutsourceimportanceforBrusselsattackevent.WenoticefromFigure24thatbothcrawlersachievealmostthesameperformance,withthecrawlerwithsourceimportancestrugglinginthefirsthalfbecauseofthedynamicvalueofthesourceimportance.Wefurtherexaminedtheresultsofthetwocrawlersandfoundthatthecollectionproducedfromthecrawlerwithsourceimportancehadonly21uniquewebsites(webpagessources)whilethecrawlerwithnosourceimportancehad81uniquewebsites.Thecrawlerwithsourceimportancesucceededincollectingthesamenumberofrelevantwebpagesasthecrawlerwithnosourceimportance,but
62
fromfarfewerwebsites.Werantheexperimentalsoon3moreevents:Californiashooting(seeFigure25),Ecuadorearthquake(seeFigure26),andOrlandoshooting(seedFigure27).
Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack
event
Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia
shootingevent
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000Percentage4of4craw
led4webpages4that4are4relevant
(Harvest4Ratio)
Total4number4of4crawled4webpages
Effect4of4Source4Importance4on4Event4Focused4Crawling
Event4Focused4Crawler4with4Source4Importance
Event4Focused4Crawler
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Percentage)of)crawled)pages)that)are)relevant
Number)of)pages)crawled
Effect)of)Source)Importance)on)event)focused)crawling)for)California)shooting)event
Event1Focused1Crawler Event1Focused1Crawler1 with1Source1Importance
63
Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador
earthquakeevent
Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlando
shootingevent
0
0.2
0.4
0.6
0.8
1
1.2
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Perce
ntage4of4craw
led4webpages4that4a
re4re
levant
Number4 of4webpages4crawled
Effect4of4source4importance4on4event4focused4crawling4for4Ecuador4earthquake4event
Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Percentage4of4crawled4webpages4that4a
re4re
levant
Number4 of4webpages4crawled
Effect4of4source4importance4on4event4focused4crawling4for4Orlando4 shooting4event
Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance
64
WecanseefromtheresultsinFigures24-27,thataddingwebpagesourceimportancetoeventmodel-basedfocusedcrawlingenhancestheperformanceandachievesabetterharvestratio.Theeventmodel-basedfocusedcrawlerwithsourceimportancestrugglesinthebeginningofthecrawlandthenmanagestoenhancetheperformanceattheendofthecrawl.Thereasonforthebadperformanceinthebeginningisthatthefocusedcrawleristryingtofindthegoodsources;onceitsettlesongoodonesitexploitstheirimportancetoreachmorerelevantwebpages.Inthissectionwecoveredresearchquestions4and5,whichaddresshowtomodelwebpagesourceimportanceandhowtointegratethemwitheventmodel-basedfocusedcrawlers.Weshowedthatusingwebpagesourceimportancehelpsthefocusedcrawlercollectrelevantwebpageswhilefocusingonimportantsources.Regardinghypothesis2,First,wehaveshownthatifweknowthebiasofdifferentwebsitesaccordingtosomecriteria,wecanincludeotherwebsitesinordertoreducebias.Second,ourexperimentalresultsshowthatintegratingeventinformationandsourceimportanceleadstoimprovedestimatesofrelevance.Additionaldemonstrationsofthesecapabilitiesaregiveninthefollowingsubsections.
7.2 SeedURLsforcrawlingThefactorsthataffectanycrawlingexperimentsarethenumberofdesiredwebpagestobecrawledandthenumberofseedURLsfedtothefocusedcrawler.SuccessfullycollectingthenumberofdesiredwebpagesdependsonthequalityandthenumberofseedURLswestartfrom.WeshouldstartfromURLsthatwilllink/leadtothelargestnumberofrelevantwebpages.TherearedifferenttypesofseedURLs.WeclassifytheseedURLswithregardtorelevanceandlinkingasfollows:relevantandlinkingtorelevantURLs,relevantandnotlinking,non-relevantandlinking,non-relevantandnotlinking.Wedon’twantthelasttype(totallyuselessURLs).Allothertypesareeithergoodintheirownright,orarepointingtoothergoodURLs.Table9summarizesthedifferenttypesofseedURLs.WedeterminewhetheraURLlinkstootherrelevantURLsornotbydownloadingthecorrespondingwebpageandextractingthelinksfromthewebpagecontent.WeestimatetherelevanceofaURLbyclassifyingtheURLtokensasrelevantornon-relevanttoanevent.URLtokensarethesetoftokensextractedfromtheURLaddressandtheURLanchortextthatappearsontheparentwebpage.
65
Table9DifferenttypesofseedURLs
LinkingtorelevantURLs NotlinkingtorelevantURLs
Relevant Hubwebpage Deadend,authoritywebpage
Non-Relevant Tunneling Reject;ignorethatpath
7.3 Semi-automatedSocialMedia-basedSeedURLGenerationSocialmediahasproventobeanimportantandrichassetforcollectingwebpagesaboutevents.EnsuringfullWebarchivecoverageofaneventisnotaneasytask,forseveralreasons.First,eventsdifferinimpactandimportance.Bigeventstendtolastforalongtime,impactmultipleplaces,andevensparkarangeofdebatesaboutdiversetopics.Second,tobuildaWebcollectionthatfullycoversaneventrequiressamplinganunbiasedsetofwebpagesfromtheWWW(whichishuge,heterogeneous,anddynamicallychanging).ThesizeoftheWWWmakesitdifficulttocollect,curate,andsampleanunbiasedsetofwebpagesusingmanualtechniques.Fortunately,focusedcrawlershavebeenproveneffective[2,26,46,66,74,75,89]inautomatingandacceleratingtheprocessofcollectingwebpages,startingfromasetofseedURLs.However,theabilityofthefocusedcrawlertofindrelevantanddiversewebpagesdependsonthequality(contentqualityandlinkingstructurequality)andthebroadcoverage(seedURLsfromdifferentwebpagesources)oftheseedURLs.Wehavebeenresearchingbuildingwebpage/tweetcollectionsaboutevents.Weidentifiedthreemainapproaches:1)theInternetArchive’sArchive-Itserviceforcollectingandarchivingwebpages,2)apairofarchivingtoolsforcollectingtweets,and3)eventmodel-basedfocusedcrawlingofwebpages.WeproposedahybridapproachforbuildingunbiasedcollectionsofwebpageswithhighcoverageusingseedURLsgeneratedfromsocialmediacontent(tweets),togetherwitheventmodel-basedfocusedcrawlers.Thetweetcollectionprocesses[1,2,74,91]ensurealargesampleofseedURLswithbroadandheterogeneousgenresofwebpages(horizontal/exploringaspect)whiletheeventmodel-basedfocusedcrawlerensureshighqualityandrelevantwebpages(vertical/exploitingaspect).
66
7.3.1 SelectingSeedURLsWeapplythefollowingstepsforselecting/curatingthesetofseedURLs:
• GrouplongURLsbysources/domains/hosts• CountthenumberoflongURLspersource(sourceimportance)• Sortsources(descending)accordingtonumberofURLsineachsource• PicktopKsourcesandthenchooseoneURLfromeachoftheKsources
ChoosingKuniquesourcesensuresdiversityoftheseedURLsandchoosingthetopKaccordingtosourceimportancemeasureensuresbroadcoverageandhighquality.AlthoughtweetcollectionsabouteventsareaveryrichsourceofseedURLs,theycontainalotofnoise(porn,jobmarketing,otherspam,andvariedothertypesofnon-relevanttweetsorURLs).OneimportantsteprequiredbeforeusingURLsextractedfromtweetsisanimportanceanalysisofeachURL/webpagesource,e.g.,consideringthedomainnameoftheURL.ThesetofseedURLsshouldbenormalized.Considerthelistbelow.URL1isthenormalizedversionofURL2.BothURLspointstothesamewebpage.
• URL1=www.cnn.com• URL2=www.cnn.com?utm_source=feedburner&utm_medium=twitter
TheURLsmentionedintweetsarenotallrelevant.Weneedtofilterthem.Weusedakeyword-basedfilteringmethod,whichincludesonlyURLsthathaveatleastoneofapre-definedsetofkeywords.Thesetofkeywordsiscreatedmanuallyforeachevent.Suchafilteringprocessiscloseinaccuracytoclassification,butmuchfaster.
7.3.2 SeedsURLDomain/SourceImportanceWedefinethewebpagesourceimportanceastheprobabilityoffindingmorerelevantwebpageswhenstartingwithaseedURLfromthatsource.Weassumethatthetopical/eventlocalitypropertyholdswherewebpagesaboutaneventlinktootherwebpagesaboutthesameevent.AlsoweassumethatURLs/webpagesfromasamesourceareconnected(i.e.,ifyoustartfromoneofthemyoucanreachtheothers).WeestimatethewebpagesourceimportancebycalculatingthenumberofURLsfromthesamesourceextractedfromthetweetcollection.Figure28showstheworkflowforextractingURLsfromtweetcollections.WeapplyourmethodsontheextractedURLstocalculatethesourceimportance.
67
Figure28Workflowforextracting,expanding,andselectingURLsfromtweets
WerananexperimentabouttheBrusselsattackevent.Weranoureventfocusedcrawlertocollect1000webpagesstartingfromdifferentsetsofseedURLs.ThesetsofseedURLstestedinourexperimentsdifferintwoaspects:thenumber(K)ofURLsandtheuniquenessoftheURLswithrespecttotheirwebsites.WeselectedtheseedURLsfromapoolofURLsextractedfromasetoftweetscollectedusingtheTwitterstreamingAPI.Table10summarizesthestatisticsfortheBrusselsattacktweetcollection.Figure29showsthelanguagedistributionofthetweets.MostofthetweetsareintheEnglishlanguage,whichisthelanguageweareworkedoninourresearch.Figures30and31showthenumberoftweets(withoutandwithURLs),andtheirdistributionacrosstime.Mostofthetweetswerepostedonthefirstdayoftheevent;theirnumberdecreaseswithtime.Figure32showsthedistributionofthesourcesaccordingtooursourceimportancemeasure.Weusedtheharvestratiomeasuretoevaluatetheoutputofthefocusedcrawlers.
68
Table10Brusselsattacktweetcollectionstatistics
Category Number
Alltweets 2,227,706
TweetsinEnglish(lang=en) 1,838,276
Tweetcreationdatedistribution:3/22/2016 1,253,152
TweetswithURLs 937,009
TweetswithURLcreationdatedistribution:3/22/2016
462,154
UniqueshortURLsextracted(lang=en) 113,402
UniquelongURLs 85,991(twitter.com=38,168)
Uniquedomains/sources 8,082(2980>=2,596>=10)
De-duplicatedURLs 74,698
URLswithkeywords“brussels,attack” 16,187
Figure29Brusselsattacktweetslanguagedistribution
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
2,000,000
en
und fr tr es nl
de ar el it th hi
ru in pl
ja svpt
daht tl fa fi
uk et
no csro ur lv hu sl
cyko iw zhbg sr
eu lt
mr is ta ne
gu
bn ka vi
te si
pa
am knhy
ml
ckb ps
or
NumberAofATweets
Language
LanguageADistribution
69
Figure30Brusselsattacktweetscreationdatedistribution
Figure31BrusselsattacktweetswithURLsdistribution
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16
Number2of2Tweets
Tweet2Creation2 Date
Tweets2Date2Distribution
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
500,000
3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16
Number2of2Tweets2with2URLs
Tweets2Creation2 Date
Tweets2w/2URLs2Date2Distribution
70
Figure32BrusselsattackseedURLsdomainsdistribution
Table11summarizestheresultsofthedifferentsettingsregardingtheseedURLsfortheBrusselsattackevent.Thefirstrowisforthecaseofhavingkuniquedomains.WechooseoneURLfromeachdomain.Thesecondrowisforthecaseofhavingk/2uniquedomains;wechoose2URLsfromeachdomain.ThedomainsfromwhichwechoosetheURLsaresortedaccordingtohowmanyURLsbelongtothem.SothetopkdomainsaretheonesthathavethelargestnumbersofURLsbelongingtothem.ThecolumnsrepresentthedifferentvaluesforK(thedesirednumberofURLswewanttoselectasseeds).Astheresultsshows,asweincreasethenumberofseeds,thefocusedcrawlerfindsmorerelevantwebpages(leadingtoahighharvestratio).Also,distributingtheseedURLsacrossseveralwebsitesincreasestheabilityofthefocusedcrawlertofindmorerelevantwebpages.Table12showsthewebpagesourcedistributionintheresultingcrawledwebpages.Astheresultsshow,usingURLswithmoreuniquesourcesproducescollectionswithamorediversesetofsources,andthisincreasesbyincreasingthenumberofseeds.Forexample,startingfrom10seedURLsfromuniquesourcesledtoacollectionof1000webpagesfrom68uniquesourcesincontrasttotheoriginal31uniquesources.
0
50
100
150
200
250
300
350
400
450
500Nu
mber-of-URLs
Domains
Domains-Distribution
71
Table11HarvestratioforeventfocusedcrawlerusingtwomethodsofseedselectionwithdifferentnumbersofseedsforBrusselsattack
K=10 K=50 K=100TopKFrequentuniquewebsites 0.685 0.752 0.817TopK/2FrequentwebsiteswithmultipleURLsfromsamesource
0.645 0.763 0.775
Table12NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforBrusselsattack
K=10 K=50 K=100TopKfrequentuniquewebsites 68 74 117
TopKfrequentwebsiteswithmultipleURLsfromsamesource
31 58 78
Tables13and14showtheresultsfortheOregonshootingevent,Tables15and16showtheresults fortheCaliforniashootingevent,andTables17and18showtheresults for theOrlandoshootingevent.For theOregonshootingevent,wesee thesameeffectasfortheBrusselsattackevent,havingseedURLsfromuniquedomains(rather thanhavingmultipleURLs from the samedomain) increases theabilityofthefocusedcrawlertofindmorerelevantwebpages. However, inthecaseK=100,therewasnoperformanceimprovement.Bothmethodsachievedthesameharvestratio(0.741)butthemethodusingURLsfromuniquedomainsproducedacollectionwith 135 unique domains while the method using multiple URLs from the samedomainproducedacollectionwithonly96uniquedomains.AnotherproblemwiththeOregonshootingcollectionisthatweranourexperiments10monthsaftertheeventhappened.Most of the seeds andwebpages about the eventno longer exist(give404errorasHTTPResponse).Thisaffectstheperformanceofbothcrawlers;addingmoreseedsdoesn’thelpinfindingrelevantwebpages.
Table13HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOregonshooting
K=10 K=50 K=100TopKFrequentuniquewebsites 0.717 0.744 0.741TopKFrequentwebsiteswithmultipleURLsfromsamesource
0.715 0.669 0.741
72
Table14NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOregonshooting
K=10 K=50 K=100TopKfrequentuniquewebsites 65 93 135
TopKfrequentwebsiteswithmultipleURLsfromsamesource
59 84 96
WealsoseethesamebehaviorintheCaliforniashootingeventasshowninTables15and16.CrawlingwithURLsfromuniquedomainsachievesbetterperformancethancrawlingwithmultipleURLsfromthesamedomain.IncreasingthenumberofseedURLsincreasestheperformanceofthefocusedcrawler.Also,inthecaseK=100,bothmethodsachieveaharvestratioofaround0.9,butthemethodusingURLsfromuniquedomainsproducedacollectionwith129uniquedomainswhilethemethodusingmultipleURLsfromsamedomainproducedacollectionwithonly101uniquedomains.AbetterwayistochooseseedURLsfromtype4(i.e.,HubURLs)(seeTable9)wheretheURLpointstoawebpageofrelevantcontentandthewebpagecontainsURLstootherrelevantwebpages.SinceweareautomatingtheprocessofselectingtheseedURLsfromthepoolofURLsextractedfromsocialmedia(i.e.,Twitter),wethinkweachievedareasonableperformancewithminimalornomanualwork.ThiscontraststothesituationwhenseedURLsarecuratedwithmanualwork.
Table15HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforCaliforniashooting
K=10 K=50 K=100TopKFrequentuniquewebsites 0.7 0.7604 0.7818TopKFrequentwebsiteswithmultipleURLsfromsamesource
0.66 0.7218 0.7548
Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling
experimentsforCaliforniashooting
K=10 K=50 K=100TopKfrequentuniquewebsites 60 97 129
TopKfrequentwebsiteswithmultipleURLsfromsamesource
54 64 101
WeexaminedtheseedURLsproducedinthecasek=100inboththeOregonandCaliforniashootingevents.WenoticedthatalthoughweselectedtheseedURLsfromimportantdomains,thetypeofthewebpagestheseURLslinkingtoareauthority/deadend,wherethecontentofthewebpagesisrelevantbuttheyarenot
73
linkingtootherrelevantwebpages.ThusaddingmoreoftheseURLsdidn’thelpthefocusedcrawlerfindorreachmorerelevantwebpages.Finally,theOrlandoshootingeventleadstothesamebehaviorasinthepreviousevents.AddingmoreURLsfromuniquedomainshelpsthefocusedcrawlerfindmorerelevantwebpages,asisshowninTable16.Inthecasek=100,addingmoreuniquedomainsachievedalmostthesameperformanceasthemethodofaddingmoreURLsfromthesamedomain(likeforthepreviouslymentionedevents).AreasonablejustificationforthatbehavioristhatwhenwepickURLsfromthetop100domains,thedomaindistributiongetswider,andweincludedomainswithalownumberofURLsfromthem(thetailofthedistribution),whencomparedtothemostfrequentdomainsatthetopofthelist.ThesedomainsdonotlinktomorerelevantURLsandthusdon’thelpthefocusedcrawlerreachmorerelevantpartsoftheWWW.Weverifiedthat,byexaminingthelast10domainsinthetop100domainsintheOrlandoshootingevent.WeexaminedthenumberofURLsthatarefromthelast10domainsinthecrawledwebpagesproducedattheendofthecrawl.Wefoundthat8outofthe10domainshad1URLonlyintheresultingcollectionofwebpages.AnoptimumselectionofseedURLsisatrade-offbetweenaddingmoreuniquedomainsandhavingseedURLsfromtype4seedURLs,whicharehighlyrelevantontheirown,andalsolinktootherrelevantURLs.WeusedthenumberofURLsfromadomaintoestimatetheprobabilitythataURLfromthatdomainwilllinktootherrelevantURLs.Abetterway(butrathercomputationallyexpensive)istobuildtheWebgraphoutoftheseedURLsandselectonlytheonesthathavehigherout-degreeorthatoptimizethecoverageanddiversitytrade-off.
Table17HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOrlandoshooting
K=10 K=50 K=100TopKfrequentuniquewebsites 0.6 0.709 0.722
TopKfrequentwebsiteswithmultipleURLsfromsamesource
0.5 0.6856 0.7158
Table18NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOrlandoshooting
K=10 K=50 K=100TopKfrequentuniquewebsites 157 181 220
TopKfrequentwebsiteswithmultipleURLsfromsamesource
13 157 181
74
8 ConclusionandFutureWorkWeproposedamodelandrepresentationforevents.Weshowedhowtorepresentaneventusingourmodel.Wecalculatedtheweightsofthethreeattributesofoureventmodelbyjointlyoptimizingtwoparameters-thenumberofkeywordsandthethresholdvalue-toyieldthebestF1-scoreevaluationmetriconamanuallylabeled(relevantandnon-relevant)datasetofURLsandwebpagesabouttheCaliforniashooting.Theresultsshowedthattheeventmodel,withtheseweightsemployed,caneffectivelyclassifyURLsandwebpagesastotheirrelevancetotheeventofinterest.Weincorporatedoureventmodelintofocusedcrawlingandshowedthatoureventmodel-basedfocusedcrawlerbuiltanevent-relatedWebcollectionmoreeffectivelythanthestate-of-the-artbest-firsttopic-onlyfocusedcrawlerontwodifferentevents:CaliforniashootingandBrusselsattack.Theresultsforsmall-scaleexperiments(collecting1000webpagesfrom38and23seedURLs,respectively)showedthatourevent-modelbasedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpagesaboutthetwoevents(i.e.,achievinghigherharvestratio).Weranexperimentsforlarge-scalecrawling(rangingfrom50K–500Kwebpages)on7differentevents:Californiashooting,Brusselsattack,Oregonshooting,Egyptairplanecrash,Panamapapers,Parisattack,andEcuadorearthquake.Weleveragedsocialmedia(i.e.,Twitter)toextractandselectseedURLsforcrawling.Oureventmodel-basedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpages.Weproposedandincorporatedwebpages’sourceimportanceintoourfocusedcrawler.Theresultsshowedthatusingwebpagesourceimportanceledtoanequivalentqualityeventrelatedcollection,relativetothebaseline,butrequiredfewersources.Finally,weshowedtheeffectoftheseedURLsonthequalityoftheresultingwebpagescollections.WedemonstrateduseofthesourceimportancemeasuretocurateandselecthighqualityseedURLsfromURLsextractedfromsocialmediacontent(tweets).
75
Wehavecoveredthefiveresearchquestionsandtherelatedhypotheses.Researchquestions1and2(andhypothesis1.1)werecoveredinChapter4:buildingtheeventmodelandrepresentationandusingitwiththefocusedcrawler.Wecoveredresearchquestion3(andhypothesis1.2)inChapter6,evaluatingtheeffectivenessofoureventmodelinfocusedcrawling.Finally,wecoveredresearchquestions4and5(andhypotheses1.3and2)inChapter7,modelingwebpagesourceimportanceandintegratingitwitheventmodel-basedfocusedcrawler.
8.1 ContributionsOurcontributionsinthisdissertationresearchare:
1. Designinganeventmodelthatcapturestheinformationneededforrepresentingevents(withdisastereventsasacasestudy).
2. Developinganevent-awarefocusedcrawlerthatusestheeventmodelforthetargeteventandforwebpagerepresentation,aswellasfordevelopinganewsimilarityfunctionthathelpsinwebpagerelevanceestimation.
3. Designingandincorporatingwebpagesourceimportancemodelintoafocusedcrawlersystem.
4. Developinganewmethodologyforsemi-automatedseedURLgenerationfromsocialmediacontent.
5. Buildinganeventdigitallibraryofevent-relatedobjects(text,metadatarecords,archives,andentities).
8.2 FutureWorkOureventmodelhascapturedthreeattributesforanevent(topic,location,anddate).Weplantoextendoureventmodelbyextractingandaddingorganizationsandparticipants;thatinformationwillrepresentthe‘Who’partinthe‘WhodidWhat,WhereandWhen’eventmodel.Thiswillenrichoureventmodelandconsequentlyshouldincreasetheeventmodel-basedfocusedcrawler’spowertoestimateandretrievemorerelevantwebpages.Further,withregardtofocusedcrawlingforlargeevents,weareintegratingourtweetcollectionefforts,thatalreadyhaveresultedinover1.2billiontweetsspreadacrossabout1000collection,withfollow-upfocusedcrawlingthatstartswithseedsthatcomefromtheURLsfoundinthosetweets.Ontheapplicationside,wealsoplantouseoureventmodeltoanalyzeandsummarizeacollectionofwebpages;thiscanworkforanycollectionaboutaparticularevent(e.g.,preparedthroughmanualcuration,orusingoureventfocusedcrawler[92]).Usingoureventmodel,wewillgeneratealistofindicativesentences,
76
andextractentitiestorepresentandsummarizeanevent.Therearemultiplealgorithmsandsoftwareimplementationsfortextsummarization,butwebelievethisconceptofcorpus/eventsummarizationisnewandworthinvestigation.Ourpreliminarystudyofsuchsummarizationsuggeststhatresultswillhavehighqualityandutility[92].Further,weplantorunmoreexperimentsondifferentkindsofeventsandtotestotherheuristicsforcombiningwebpagesourceimportancewitheventmodel-basedrelevancescores.Finally,wewillbuildaknowledgebaseofsources,foreachtypeofevent.TheknowledgebasewillincludealistofURLs,extractedfromsocialmediacontentaboutdifferenttypesofevents.ThelistofextractedURLswillbeusedtobuildalistofpairs:sourcesandtheirimportancescore(howmanyrelevantURLsarefromasource).Thislistcouldbeusedtocomputepriorsforthesourceimportancemodelduringcrawling.
77
References1. Farag,M.M.G.andE.A.Fox,Buildingandarchivingeventwebcollections:A
focusedcrawlerapproach,inBulletinofIEEETechnicalCommitteeonDigital
Libraries.2015.p.1-2.2. Farag,M.andE.A.Fox,FocusedCrawlingForEvents.InternationalJournalof
DigitalLibraries,SpecialIssueofWebArchiving-Inreview,2016.3. Magdy,M.andE.A.Fox,IntelligentEventFocusedCrawling,inThe11th
InternationalConferenceonInformationSystemsforCrisisResponseand
Management(ISCRAM)-Poster.2014:UniversityPark,Pennsilvenya,USA.4. O'Reilly,T.,WhatisWeb2.0:Designpatternsandbusinessmodelsforthenext
generationofsoftware.Communications&strategies,2007.1(1):p.17.5. IDEAL.IntegratedDigitalEventArchiveandLibrary.2016[cited2016April26];
Availablefrom:http://eventsarchive.org/.6. Internet_Archive.InternetArchive,Adigitallibraryoffreecontentandwayback
machine.2016[cited2016April26];Availablefrom:https://archive.org/.7. Farag,M.,P.Nakate,andE.A.Fox,BigDataProcessingofSchoolShooting
Archives,inProceedingsofthe16thACM/IEEE-CSonJointConferenceonDigital
Libraries.2016,ACM:Newark,NewJersey,USA.p.271-272.8. IDEAL_Collections.IDEALWebCollectionsandTweetArchives.2016[cited2016
April26];Availablefrom:http://eventsarchive.org/eventstable.9. Archive-It.Webarchivingservicesforlibrariesandarchives.2016[cited2016
April26];Availablefrom:https://archive-it.org/.10. G.Mohr,etal.IntroductiontoHeritrix,anArchivalQualityWebCrawler.in
Proceedingsofthe4thInternationalWebArchivingWorkshop(IWAW’04).2004.11. Fox,E.A.andJ.P.Leidig,DigitalLibraryApplications:CBIR,Education,Social
Networks,eScience/Simulation,andGIS.2014:Morgan&ClaypoolPublishers.12. Fox,E.A.andR.d.S.Torres,DigitalLibraryTechnologies:ComplexObjects,
Annotation,Ontologies,Classification,Extraction,andSecurity.2014:Morgan&ClaypoolPublishers.
13. Shen,R.,M.A.Goncalves,andE.A.Fox,KeyIssuesRegardingDigitalLibraries:EvaluationandIntegration.2013:Morgan&ClaypoolPublishers.
14. Fox,E.A.,M.A.Goncalves,andR.Shen,TheoreticalFoundationsforDigitalLibraries:The5S(Societies,Scenarios,Spaces,Structures,Streams)Approach.2012:Morgan&ClaypoolPublishers.
15. Salton,G.andC.Buckley,Term-weightingapproachesinautomatictextretrieval.InformationProcessing&Management,1988.24(5):p.513-523.
16. Salton,G.andM.J.McGill,IntroductiontoModernInformationRetrieval.1986:McGraw-Hill,Inc.
17. Pant,G.,P.Srinivasan,andF.Menczer,Crawlingtheweb,inWebDynamics.2004,Springer.p.153-177.
18. Manning,C.D.,etal.,IntroductiontoInformationRetrieval.2008:CambridgeUniversityPress.496.
78
19. Archive-It.Archive-ItCollections,SpontaneousEvents.2016[cited2016July];Availablefrom:https://archive-it.org/explore?show=Collections&fc=meta_Subject%3ASpontaneousevents.
20. Yang,S.,etal.AstudyofautomationfromseedURLgenerationtofocusedweb
archivedevelopment:theCTRnetcontext.inProceedingsofthe12thACM/IEEE-
CSjointconferenceonDigitalLibraries.2012.ACM.21. IDEAL.IDEALTweetCollections.2016[cited2016August24];Availablefrom:
http://hadoop.dlib.vt.edu/.22. Fox,E.A.andM.M.Farag,Reportontheworkshoponwebarchivinganddigital
libraries(WADL2013).SIGIRForum,2013.47(2):p.128-133.23. Fox,E.A.,Z.Xie,andM.Klein,WADL2016:ThirdInternationalWorkshoponWeb
ArchivingandDigitalLibraries,inProceedingsofthe16thACM/IEEE-CSonJoint
ConferenceonDigitalLibraries.2016,ACM:Newark,NewJersey,USA.p.293-294.
24. Fox,E.A.andZ.Xie,WebArchivingandDigitalLibraries(WADL),inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015,ACM:Knoxville,Tennessee,USA.p.303-303.
25. WARC.Informationanddocumentation--WARCfileformat-ISO28500:2009.2016[cited2016August24];Availablefrom:http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717.
26. Chakrabarti,S.,M.VandenBerg,andB.Dom,Focusedcrawling:anewapproachtotopic-specificWebresourcediscovery.ComputerNetworks,1999.31(11):p.1623-1640.
27. Batsakis,S.,E.G.Petrakis,andE.Milios,Improvingtheperformanceoffocused
webcrawlers.Data&KnowledgeEngineering,2009.68(10):p.1001-1013.28. Pant,G.andP.Srinivasan,Learningtocrawl:Comparingclassificationschemes.
ACMTransactionsonInformationSystems(TOIS),2005.23(4):p.430-462.29. Rennie,J.andA.McCallum.Efficientwebspideringwithreinforcementlearning.
inProceedingsoftheInternationalConferenceonMachineLearning.1999.Citeseer.
30. Grigoriadis,A.andG.Paliouras,Focusedcrawlingusingtemporaldifference-
learning,inMethodsandApplicationsofArtificialIntelligence.2004,Springer.p.142-153.
31. Singh,N.,etal.LargeScaleURL-basedClassificationUsingOnlineIncremental
Learning.inMachineLearningandApplications(ICMLA),201211thInternational
Conferenceon.2012.IEEE.32. Menczer,F.andA.E.Monge,Scalablewebsearchbyadaptiveonlineagents:An
InfoSpiderscasestudy,inIntelligentInformationAgents.1999,Springer.p.323-347.
33. Meusel,R.,P.Mika,andR.Blanco,FocusedCrawlingforStructuredData,inProceedingsofthe23rdACMInternationalConferenceonConferenceon
InformationandKnowledgeManagement.2014,ACM:Shanghai,China.p.1039-1048.
79
34. Ehrig,M.andA.Maedche.Ontology-focusedcrawlingofWebdocuments.inProceedingsofthe2003ACMsymposiumonAppliedcomputing.2003.ACM.
35. Dong,H.,F.K.Hussain,andE.Chang.Asurveyinsemanticwebtechnologies-
inspiredfocusedcrawlers.inDigitalInformationManagement,2008.ICDIM
2008.ThirdInternationalConferenceon.2008.IEEE.36. Yang,S.,etal.,CTRnetDLfordisasterinformationservices,inProceedingsofthe
11thannualinternationalACM/IEEEjointconferenceonDigitallibraries.2011,ACM:Ottawa,Ontario,Canada.p.437-438.
37. Vural,A.G.,B.B.Cambazoglu,andP.Karagoz,Sentiment-focusedwebcrawling.ACMTransactionsontheWeb(TWEB),2014.8(4):p.22.
38. Fu,T.,etal.,Sentimentalspidering:leveragingopinioninformationinfocused
crawlers.ACMTransactionsonInformationSystems(TOIS),2012.30(4):p.24.39. Almpanidis,G.,C.Kotropoulos,andI.Pitas,Combiningtextandlinkanalysisfor
focusedcrawling—Anapplicationforverticalsearchengines.InformationSystems,2007.32(6):p.886-908.
40. Diligenti,M.,etal.FocusedCrawlingUsingContextGraphs.inVLDB.2000.41. Pant,G.andP.Srinivasan,Linkcontextsinclassifier-guidedtopicalcrawlers.
KnowledgeandDataEngineering,IEEETransactionson,2006.18(1):p.107-122.42. Kleinberg,J.M.,etal.Thewebasagraph:measurements,models,andmethods.
inInternationalComputingandCombinatoricsConference.1999.Springer.43. Brin,S.andL.Page,Reprintof:Theanatomyofalarge-scalehypertextualweb
searchengine.Computernetworks,2012.56(18):p.3825-3833.44. Page,L.,etal.,ThePageRankcitationranking:bringingordertotheweb.1999:
TechnicalReport.StanfordInfoLab.45. DeAssis,G.T.,etal.Exploitinggenreinfocusedcrawling.inStringProcessingand
InformationRetrieval.2007.Springer.46. Pant,G.andP.Srinivasan,Predictingwebpagestatus.InformationSystems
Research,2010.21(2):p.345-364.47. Pant,G.andP.Srinivasan,StatusLocalityontheWeb:ImplicationsforBuilding
FocusedCollections.InformationSystemsResearch,2013.24(3):p.802-821.48. Chen,Y.,Anovelhybridfocusedcrawlingalgorithmtobuilddomain-specific
collections.2007,VirginiaPolytechnicInstituteandStateUniversity.49. Allan,J.,Introductiontotopicdetectionandtracking,inTopicdetectionand
tracking.2002,Springer.p.1-16.50. Volkova,S.,etal.,Animaldiseaseeventrecognitionandclassification.UsingWeb
DataintheMedicalDomain,2010:p.54.51. Westermann,U.andR.Jain,Towardacommoneventmodelformultimedia
applications.IEEEMultiMedia,2007.14(1):p.19-29.52. Strötgen,J.,M.Gertz,andC.Junghans.Anevent-centricmodelformultilingual
documentsimilarity.inProceedingsofthe34thinternationalACMSIGIR
conferenceonResearchanddevelopmentinInformationRetrieval.2011.ACM.53. Li,Z.,etal.,Aprobabilisticmodelforretrospectivenewseventdetection,in
Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearch
80
anddevelopmentininformationretrieval.2005,ACM:Salvador,Brazil.p.106-113.
54. Ha-Thuc,V.,etal.Newseventmodelingandtrackinginthesocialwebwith
ontologicalguidance.inSemanticComputing(ICSC),2010IEEEFourth
InternationalConferenceon.2010.IEEE.55. Ha-Thuc,V.,etal.Arelevance-basedtopicmodelfornewseventtracking.in
Proceedingsofthe32ndinternationalACMSIGIRconferenceonResearchand
developmentininformationretrieval.2009.ACM.56. Becker,H.,M.Naaman,andL.Gravano,Beyondtrendingtopics:Real-world
eventidentificationonTwitter,inFifthInternationalAAAIConferenceonWeblogsandSocialMedia.2011:Barcelona,Spain.
57. Parikh,R.andK.Karlapalem,ET:eventsfromtweets,inProceedingsofthe22ndInternationalConferenceonWorldWideWeb.2013,ACM:RiodeJaneiro,Brazil.p.613-620.
58. Ritter,A.,etal.,OpendomaineventextractionfromTwitter,inProceedingsofthe18thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddata
mining.2012,ACM:Beijing,China.p.1104-1112.59. Ritter,A.,etal.,WeaklySupervisedExtractionofComputerSecurityEventsfrom
Twitter,inProceedingsofthe24thInternationalConferenceonWorldWideWeb.2015,ACM:Florence,Italy.p.896-905.
60. Strotgen,J.,M.Gertz,andC.Junghans,Anevent-centricmodelformultilingual
documentsimilarity,inProceedingsofthe34thinternationalACMSIGIR
conferenceonResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.953-962.
61. Yom-Tov,E.andF.Diaz,Locationandtimelinessofinformationsourcesduring
newsevents,inProceedingsofthe34thinternationalACMSIGIRconferenceon
ResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.1105-1106.
62. Lakka,C.,etal.,ABayesiannetworkmodelingapproachforcrossmediaanalysis.SignalProcessing:ImageCommunication,2011.26(3):p.175-193.
63. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb
CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015.Knoxville,Tennessee,USA.
64. AlNoamany,Y.,M.C.Weigle,andM.L.Nelson,Detectingoff-topicpagesinwebarchives.ResearchandAdvancedTechnologyforDigitalLibraries,Springer,2015:p.225-237.
65. Menczer,F.,etal.Evaluatingtopic-drivenwebcrawlers.inProceedingsofthe24thannualinternationalACMSIGIRconferenceonResearchanddevelopment
ininformationretrieval.2001.ACM.66. Menczer,F.,G.Pant,andP.Srinivasan,Topicalwebcrawlers:Evaluating
adaptivealgorithms.ACMTransactionsonInternetTechnology(TOIT),2004.4(4):p.378-419.
81
67. Srinivasan,P.,F.Menczer,andG.Pant,Ageneralevaluationframeworkfor
topicalcrawlers.InformationRetrieval,2005.8(3):p.417-447.68. Borlund,P.,TheconceptofrelevanceinIR.JournaloftheAmericanSocietyfor
informationScienceandTechnology,2003.54(10):p.913-925.69. Hjørland,B.,Thefoundationoftheconceptofrelevance.JournaloftheAmerican
SocietyforInformationScienceandTechnology,2010.61(2):p.217-237.70. Schamber,L.,RelevanceandInformationBehavior.Annualreviewofinformation
scienceandtechnology(ARIST),1994.29:p.3-48.71. Saracevic,T.,Relevance:Areviewoftheliteratureandaframeworkforthinking
onthenotionininformationscience.PartIII:Behaviorandeffectsofrelevance.JournaloftheAmericanSocietyforInformationScienceandTechnology,2007.58(13):p.2126-2144.
72. Mizzaro,S.,Relevance:Thewholehistory.JASIS,1997.48(9):p.810-832.73. Voorhees,E.M.,Variationsinrelevancejudgmentsandthemeasurementof
retrievaleffectiveness.Informationprocessing&management,2000.36(5):p.697-716.
74. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb
CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CEonJointConferenceonDigitalLibraries.2015.ACM.
75. Batsakis,S.,E.G.M.Petrakis,andE.Milios,Improvingtheperformanceoffocused
webcrawlers.DataKnowl.Eng.,2009.68(10):p.1001-1013.76. Salton,G.,A.Wong,andC.S.Yang,AVectorSpaceModelforAutomaticIndexing.
CommunicationsoftheACM,1975.18(11):p.613-620.77. Brin,S.andL.Page,Theanatomyofalarge-scalehypertextualWebsearch
engine.ComputernetworksandISDNsystems,1998.30(1-7):p.107-117.78. Cho,J.,H.Garcia-Molina,andL.Page,EfficientCrawlingThroughURLOrdering,
inSeventhInternationalWorld-WideWebConference(WWW1998).1998:Brisbane,Australia.
79. Heydon,A.andM.Najork,Mercator:Ascalable,extensiblewebcrawler.WorldWideWeb,1999.2(4):p.219-229.
80. Kobayashi,M.andK.Takeda,Informationretrievalontheweb.ACMComput.Surv.,2000.32(2):p.144-173.
81. Chakrabarti,S.,MiningtheWeb:DiscoveringKnowledgefromHyperTextData.2002:ScienceandTechnologyBooks.350.
82. Castillo,C.,Effectivewebcrawling.SIGIRForum,2005.39(1):p.55-56.83. Chakrabarti,S.,K.Punera,andM.Subramanyam,Acceleratedfocusedcrawling
throughonlinerelevancefeedback,inProceedingsofthe11thinternationalconferenceonWorldWideWeb.2002,ACM:Honolulu,Hawaii,USA.p.148-159.
84. Atkinson,M.D.,etal.,Min-maxheapsandgeneralizedpriorityqueues.CommunicationsoftheACM,1986.29(10):p.996-1000.
85. Min-max_heap.Min-maxheap.2016[cited2016August24];Availablefrom:https://en.wikipedia.org/wiki/Min-max_heap.
86. Yom-Tov,E.andF.Diaz,Outofsight,notoutofmind:ontheeffectofsocialand
physicaldetachmentoninformationneed,inProceedingsofthe34th
82
internationalACMSIGIRconferenceonResearchanddevelopmentinInformation
Retrieval.2011,ACM:Beijing,China.p.385-394.87. Foley,J.,M.Bendersky,andV.Josifovski,LearningtoExtractLocalEventsfrom
theWeb,inProceedingsofthe38thInternationalACMSIGIRConferenceon
ResearchandDevelopmentinInformationRetrieval.2015,ACM:Santiago,Chile.p.423-432.
88. Baeza-Yates,R.andB.Ribeiro-Neto,Moderninformationretrieval.Vol.463.1999:ACMpress,NewYork.
89. Aggarwal,C.C.,F.Al-Garawi,andP.S.Yu.IntelligentcrawlingontheWorldWide
Webwitharbitrarypredicates.inProceedingsofthe10thinternationalconferenceonWorldWideWeb.2001.ACM.
90. Liu,H.,J.Janssen,andE.Milios,UsingHMMtolearnuserbrowsingpatternsfor
focusedwebcrawling.Data&KnowledgeEngineering,2006.59(2):p.270-291.91. Farag,M.andE.A.Fox,Whichwebpageshouldwecrawlfirst?Socialmedia-
basedwebpagesourceimportanceguidance,inWorkshoponWebArchivingand
DigitalLibraries(WADL2016)-JointConferenceonDigitalLibraries(JCDL2016).2016:Newark,NJ,USA.
92. Farag,M.andE.A.Fox,Webarchivecontentanalysis.2015:PresentedatInternationalInternetPresentationConsortiumGeneralAssemblyIIPC2015,California,USA.