web archive analysis - rutgers...

23
1 Web Archive Analysis Vinay Goel Senior Data Engineer Internet Archive [email protected]

Upload: others

Post on 29-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

1

Web Archive Analysis

Vinay GoelSenior Data Engineer

Internet [email protected]

Page 2: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

2

Access Web Archive Data

Wayback Machine

Search

Page 3: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

3

Enable Research & Analysis

Page 4: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

Analysis Workflow

Page 5: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

5

Data: WARC

Data written by web crawlers Web ARchive Container File - WARC (ISO standard) Revision of the ARC file format Each file contains a series of concatenated records

– Full HTTP request/response records

– Metadata records (links, crawler path, encoding etc.)

– Records to store duplicate detection events

– Records to support segmentation and conversion

Page 6: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

6

Derived Data: CDX

Index for Wayback Machine Space delimited text file Contains only essential fields needed by Wayback

– URL, Timestamp, Content Digest

– MIME type, HTTP Status Code

– Redirect URL, meta tags, size

– WARC filename and file offset of record

Page 7: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

7

Wayback Machine

Page 8: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

8

Growth of content

Page 9: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

9

Growth of content

Page 10: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

10

Rate of duplication

Page 11: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

11

Breakdown by Year-First-Crawled

Page 12: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

Log Analysis (Hive/Pig/Giraph)

CDX Warehouse Crawl Log Warehouse Distribution of HTTP status codes, MIME types Find timeout errors, duplicate content, crawler traps,

robots exclusions Trace path of the crawler

Page 13: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

13

Derived Data: Parsed Text

Input to build text indexes for Search Text is extracted from (W)ARC files HTML boilerplate is stripped out Also contains metadata for each record

– URL, Timestamp, Content Digest, Record Length

– MIME type, HTTP status code

– Title, description and meta keywords

– Links with anchor text

Page 14: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

14

Derived Data: WAT

Extensible Metadata format Essential metadata for many types of analyses Avoids barriers to data exchange: copyright, privacy Less data than WARC, more than CDX WAT records are WARC metadata records Contains for every HTML page in the WARC,

– Title, description and meta keywords

– Embeds and outgoing links with alt / anchor text

Page 15: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

15

Text Analysis (Pig/Mahout)

Text extracted from WARC / Parsed Text / WAT files Use curated collections to train Classifiers Cluster documents in collections Topic Modeling

– Discover topics

– Study how topics evolve over time Compare how a page describes itself (meta text) vs.

how other pages linking to it describe it (anchor text)

Page 16: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

16

Link Analysis (Pig/Giraph)

Links extracted from crawl logs / WARC metadata records / Parsed Text / WAT files

Indegree and Outdegree information Inter-host and Intra-host link information Study how linking behavior changes over time Rank resources by PageRank

– Identify important resources

– Prioritize crawling of missing resources Find possible spam pages by running biased PageRank

algorithms

Page 17: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

17

Completeness

Page 18: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

18

PageRank over Time

Page 19: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

19

PageRank over Time

Page 20: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

20

PageRank over Time

Page 21: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

21

Archive Analysis Workshop

Generate derivatives: CDX, WAT, Parsed Text Set up CDX Warehouse using Hive Extract links from WARCs / WAT / Parsed Text Extract text from WARCs / WAT / Parsed Text Generate Archival web graphs

– Assign integer / fingerprint ID to URLs

– Represent graph as an adjacency list using these IDs and the timestamp info

Generate host and domain graphs

Page 22: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

22

Archive Analysis Workshop

Text Analysis using Pig / Mahout

– Extract top terms using TF-IDF

– Prepare text for analysis with Mahout Link Analysis with Pig / Giraph

– Degree Analysis

– PageRank

– Find common links between entities Data extraction

– Repackage subset of data into new (W)ARCs

Page 23: Web Archive Analysis - Rutgers Universitywp.comminfo.rutgers.edu/.../07/Vinay_WebResearchIA_June2014_Vinay … · 21 Archive Analysis Workshop Generate derivatives: CDX, WAT, Parsed

23

Archive Analysis Workshop

https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop