cs6604 digital libraries global events team final presentation · cs6604 digital libraries global...
TRANSCRIPT
CS6604 Digital LibrariesGlobal Events Team Final Presentation
Presenters:Liuqing Li, Islam Harb, Andrej Galad
{liuqing, iharb, agalad}@vt.edu
Instructor: Dr. Edward A. Fox
Virginia Polytechnic Institute and State UniversityBlacksburg, VA, 24061
April 27, 2017
Global Events Team Final Presentation
• Background• Implementation
• DataCollection• DataProcessing• DataVisualization
• FutureWork• Acknowledgement
Outline
1
Global Events Team Final Presentation
Background
2
• GETAR*• GlobalEventandTrendArchiveResearch• Architecture
* Edward A Fox, Donald Shoemaker, Chandan Reddy, Andrea Kavanaugh, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR), NSF grant IIS - 1619028, 2017-2019. http://eventsarchive.org
Global Events Team Final Presentation
Implementation – Architecture
3
Event Focused Crawler (EFC)
WARCFiles CDXFilesCDX Writer
ArchiveSpark
ApacheSpark
StanfordNER
RegularExpression
ScoreFunction
Entity-basedResults
Standalone HBase
WebApplication
Data Collection
Data Processing
Data Visualization
Global Events Team Final Presentation
SchoolShootingEvents YearVirginiaTechShooting 2007
NorthernIllinoisUniversityShooting 2008DunbarHighSchoolShooting 2009UniversityofAlabamaShooting 2010Worthing HighSchoolShooting 2011
SandyHookElementarySchoolShooting 2012SparksMiddleSchoolShooting 2013ReynoldsHighSchoolShooting 2014
UmpquaCommunityCollegeShooting 2015TownvilleElementarySchoolShooting 2016
Events of Interest
4
Global Events Team Final Presentation
Focused Crawler – Collecting / Archiving
5
START
ManuallyCurateSeeds
URLsQueue
DownloadPage
ProcessPage&ConvertintoWARCFormat
ExtractURLs
CalculateRelevancy
Relevant?
Discard
AppendResultwarc.gz EventFile
END
Yes
No
No
Yes
AllURLs?
Global Events Team Final Presentation
• Wget (Version1.14orlater)
WARC Libraries
6
Global Events Team Final Presentation
• Wpull
WARC Libraries
7
Global Events Team Final Presentation
• WARCIO:WARC(andARC)StreamingLibrary• Python2.7+and3.3+• Post-Processing:Read/WriteWARCformat
WARC Libraries
8
Global Events Team Final Presentation
• NamingConvention• [location]_[year].warc.gz
Ten Events Collections
9
Global Events Team Final Presentation
• ArchiveSpark• ApacheSparkframeworkforWebArchives• Easydataextraction• Input:WARCandCDXfiles
• CDXWriter• PythonscripttocreateCDXfilesofWARCfiles• Format:CDXNbamskrMSVg
• e.g.,edu,vt,cnre)/20170422005601http://cnre.vt.edu text/html200BT3ILJXROIILHBKQPNYDUCUVZRDKG3OA- - 947820104749data/Virginia-Tech-Shooting_20070416.warc.gz
Tools for Data Processing
10
Global Events Team Final Presentation
• WebpageCleaning• ExtractRawText
• payload.string.html.body.text• RemovejQuery&JavaScript
• {WPGroHo.syncProfileData(hash,id);},…• Removetags
• <br>,<p>,…• Removemarkers
• *,|,+,…• Removestopwords
• a,about,the,…
Data Preprocessing
11
Global Events Team Final Presentation
• EntityExtraction• BasicParsing
• eventnameanddate• StanfordNER(Integratedmodel)
• entities,shootername• RegularExpression
• eventdate• shooternameandage• numberofvictims• weaponlist
• ScoreFunction• 𝑡𝑓 ∗ 𝑑𝑓
Data Processing
12
Global Events Team Final Presentation
• Build-inImportTsv Utility• ImportDataintoHBase
HBase
13
Table Name globalevents
Row_Key Event_Date + Event Hash Value 20070416217787922
Column Family event
Column
event: name Virginia Tech Shooting
event: date 20070416
event: shooter_age 23-year-old
event: shooting_victims 32 victims
event: entities Virginia;Tech;VA;University;…
event: entities_count 146900;62415;13940;7732;…
event: entities_url url1,url2,url3,url4,url6;url2,url3,url4,url5;url1,url3,url4,url5,url6;…
Global Events Team Final Presentation
• KeyStages• Initialization
• CreateSparkSession• CreateNLPCore• CreateStorage
• Processing• ExtractEventName/Date/URL• ExtractNameEntities• ExtractOtherEventFeatures
• ExportandImport• GenerateTSVfile• ImportTSVfileintoHBase
Data Processing – Demo
14
Global Events Team Final Presentation
• Efficientvisualizationoflong-termglobalevents• Showrepresentativeterms->linktocorrespondingURLs• Visualizeevents’trendsovertime(timeseries)
• Java7SpringBootWebapplication• Buildsystem- Gradle• EmbeddedTomcatWebserver• Backend- HBase,in-memory• Frontend- D3.js,Bootstrap
Global Events Viewer
15
https://github.com/dedocibula/global-events-viewer
Global Events Team Final Presentation
• KeyComponents• WordCloud,RangeSelection,URLList,Trends
Global Events Viewer – Demo
16
Global Events Team Final Presentation
Problem Faced
17
DataCollectionEncodingproblems(UTF-8, ASCIIandothers)Get morerelevantseedsforoldevents
DataProcessingLack ofdocumentation(ArchiveSpark)Versionconflict(CDXWriter,Kernel inJupyter)JVMissue(Spark)
DataVisualizationSpringbootIntelliJsetupJQueryUI
Global Events Team Final Presentation
Lessons Learned
18
DataCollectionWARCIOFocusedCrawler
Data ProcessingArchiveSparkSpark& Scala(Map/ReduceProcess)
DataVisualizationD3WordCloudD3DynamicLineCharts
Global Events Team Final Presentation
Future Work
19
DataCollectionWayback MachineAutomaticRoutineforFocusedCrawlerEvent Extension(Sources,Time,Space)
Data ProcessingStandaloneMode-> ClusterModeNameEntityRecognizerAutomaticProcessing(CDXWriter andHBase)
DataVisualizationLocalization– DatamapsWeapons
Global Events Team Final Presentation
Acknowledgement
20
Projects
NSFIIS- 1319578 III:Small:IntegratedDigitalEventArchivingandLibrary(IDEAL)
NSFIIS- 1619028 III:Small:CollaborativeResearch:GlobalEventandTrendArchiveResearch(GETAR)
OrganizationsInternetArchiveL3SResearchCenter
PersonsInstructor Dr.EdwardA.FoxAlumnus Dr.MohamedMagdy FaragLabmates PrashantChandrasekar, XuanZhang
Thank you !
Questions?