tpdl 2015 - profiling web archives

34
Profiling Web Archives Sawood Alam and Michael L. Nelson Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar Shankar Los Alamos National Laboratory, Los Alamos, NM David S. H. Rosenthal Stanford University Libraries, Stanford, CA Supported in part by the International Internet Preservation Consortium (IIPC)

Upload: sawood-alam

Post on 16-Feb-2017

1.740 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: TPDL 2015 - Profiling Web Archives

Profiling Web Archives

Sawood Alam and Michael L. NelsonComputer Science Department, Old Dominion University

Norfolk, Virginia - 23529

Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar ShankarLos Alamos National Laboratory, Los Alamos, NM

David S. H. RosenthalStanford University Libraries, Stanford, CA

Supported in part by the International Internet Preservation Consortium (IIPC)

Page 2: TPDL 2015 - Profiling Web Archives

Memento Aggregator

Page 3: TPDL 2015 - Profiling Web Archives

Memento Aggregator

Page 4: TPDL 2015 - Profiling Web Archives

Memento Aggregator

Page 5: TPDL 2015 - Profiling Web Archives

Memento Aggregator

Page 6: TPDL 2015 - Profiling Web Archives

Memento Aggregator

Page 7: TPDL 2015 - Profiling Web Archives

Memento Aggregator

Page 8: TPDL 2015 - Profiling Web Archives

Long Tail of Archives

Page 9: TPDL 2015 - Profiling Web Archives

Long Tail of Archives

● 400B+ web pages at IA do not cover everything

● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013)

● Targeted crawls● Special focus archives● Restricted resources● Private archives

Page 10: TPDL 2015 - Profiling Web Archives

Archive Profile

● High-level summary of an archive● Predicts presence of mementos of a URI-R

in an archive● Provides various statistics about the

holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other

things

Page 11: TPDL 2015 - Profiling Web Archives

Available Profiling Resources

● Client request● Archive response● Archive index (CDX files)

Page 12: TPDL 2015 - Profiling Web Archives

A Client Request

Page 13: TPDL 2015 - Profiling Web Archives

An Archive Response

Page 14: TPDL 2015 - Profiling Web Archives

A CDX Snippet

Page 15: TPDL 2015 - Profiling Web Archives

Profiling Strategies

● Complete URI-R Profiling (1 URI-R = 1 Profile Key)

○ bbc.co.uk/images/logo.png?w=90○ cnn.com/2014/03/15/?id=128734

● TLD-only Profiling (1 TLD = 1 Profile Key)

○ com)/○ uk)/

● Middle Ground○ uk,co)/○ uk,co,bbc)/images○ uk,co,bbc)/0/2/1○ com,cnn)/ 201309 ar

Page 16: TPDL 2015 - Profiling Web Archives

Frequency Measurements

Page 17: TPDL 2015 - Profiling Web Archives

CDXJ Serialization

Page 18: TPDL 2015 - Profiling Web Archives

URI-Key Generation

Page 19: TPDL 2015 - Profiling Web Archives

Profile Merging

Base profile

New profile

Merged profile

Page 20: TPDL 2015 - Profiling Web Archives

Dataset

● Three archives● Four sample query sets● 23 profiles for each archive and sample set

Page 21: TPDL 2015 - Profiling Web Archives

Archives

Archive URI-Rs URI-Ms Size

Archive-It 1.9B 5.3B 1.8TB

UKWA 0.7B 1.7B 0.5TB

Stanford 12M 25M 8.3GB

Page 22: TPDL 2015 - Profiling Web Archives

Sample Query Sets

Sample In Archive-It In UKWA In Stanford

DMOZ 4.097% 1.912% 0.034%

MementoProxy 4.182% 0.179% 0.046%

IAWayback 3.716% 0.231% 0.039%

UKWayback 0.108% 0.034% 0.002%

Sample Size: 1M URIs Each

Page 23: TPDL 2015 - Profiling Web Archives

Evaluation

● Relate CDX Size, URI-M, URI-R, and URI-Key

● Analyze profile growth● Estimate Relative Cost● Evaluate Routing Precision vs. Relative Cost

Page 24: TPDL 2015 - Profiling Web Archives

CDX Size vs URI-M (UKWA 10 Years)

Alpha: 175 bytes per CDX line

Page 25: TPDL 2015 - Profiling Web Archives

URI-M vs URI-R (UKWA 10 Years)

Gamma: 2.46 K : 2.686Beta: 0.911

Page 26: TPDL 2015 - Profiling Web Archives

Space Cost (UKWA 7 Years)

Phi: 8.5e-07 -- 0.70583

Page 27: TPDL 2015 - Profiling Web Archives

Time Cost (UKWA 7 Years)

Tau: 5.7e-05 -- 6.2e-05CDX: 45GBURI-Ms: 181MURI-Rs: 96MTime: 3 hours

Page 28: TPDL 2015 - Profiling Web Archives

Resource Requirement

Page 29: TPDL 2015 - Profiling Web Archives

Archive-It

Page 30: TPDL 2015 - Profiling Web Archives

UKWA

Page 31: TPDL 2015 - Profiling Web Archives

Stanford

Page 32: TPDL 2015 - Profiling Web Archives

Cost vs Precision

Group Cost Precision

G1 (H1P0/TLD) Bound by # of TLDs < 0.05

G2 (H3P0, DDom, DSub, DPth, DQry) < 0.01 ≈ 2 * G1

G3 (DIni) ≈ 2 * G2 ≈ (3--4) * G1

G4 (HxP1) ≈ 5 * G3 ≈ (5--7) * G1

G5 (Higher HmPn) 0.4 -- 0.7 Not Explored

G6 (URIR) 1.0 1.0

Page 33: TPDL 2015 - Profiling Web Archives

Future Work

● Generating sample URI sets● Profiling via sampling● Language profiles● Evaluation of combination profiles such as

URI-Key along with Datetime● Profiles for usage other than Memento

routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)

Page 34: TPDL 2015 - Profiling Web Archives

Conclusions

● Generated profiles with different policies for two archives

● Examined cost-precision tradeoffs of various policies

● Related CDX Size, URI-M, URI-R, and URI-Key

● Gained up to 22% routing precision with <5% relative cost without any false negatives

● Code @ GitHub:/oduwsdl/archive_profiler