profiling web archives

43
Profiling Web Archives Profiling Web Archives Michael L. Nelson Ahmed AlSum, Michele C. Weigle Herbert Van de Sompel, David Rosenthal IIPC General Assembly Paris, France, May 21, 2014 1

Upload: michael-nelson

Post on 12-Dec-2014

998 views

Category:

Science


0 download

DESCRIPTION

Profiling Web Archives IIPC General Assembly Paris, France, May 21, 2014 Michael Nelson, Ahmed AlSum, Michele Weigle, Herbert Van de Sompel, David Rosenthal

TRANSCRIPT

Page 1: Profiling Web Archives

Profiling Web ArchivesProfiling Web Archives

Michael L. Nelson

Ahmed AlSum, Michele C. Weigle

Herbert Van de Sompel, David Rosenthal

IIPC General AssemblyParis, France, May 21, 2014

1

Page 2: Profiling Web Archives
Page 3: Profiling Web Archives
Page 4: Profiling Web Archives
Page 5: Profiling Web Archives
Page 6: Profiling Web Archives

Where's that issuewith the Afghan girl?

Page 7: Profiling Web Archives

7

Page 8: Profiling Web Archives

8

Page 9: Profiling Web Archives

9

Page 10: Profiling Web Archives

Prior IIPC Memento Aggregator ProjectPrior IIPC Memento Aggregator Project

• Ten IIPC archives, led by LANL

• Conceived at 2011 IIPC meeting

• Results reported at 2012 IIPC meetingo http://netpreserve.org/sites/default/files/resources/Sanderson.pdf

• Two highlights:

Page 11: Profiling Web Archives
Page 12: Profiling Web Archives
Page 13: Profiling Web Archives

Stop and Rethink…Stop and Rethink…

• LANL's processing was informative from a "big data" perspective, but was neither scalable nor sustainableo "send us your CDX" == hard for both partieso there are lots of URIs in the world

• Will only get worse with:o more archives… o …doing more archiving

Page 14: Profiling Web Archives

Leverage Memento AggregatorsLeverage Memento Aggregators

• Memento aggregator currently broadcast URI lookups to all known archives

• New approach: 1. build profiles based on sampling from URI lookups

(optionally supplement with CDX files when available)

2. Use archive profiles for informing Memento aggregator "query routing" decisions

3. Share serialized profiles with other IIPC partners

http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.bnf.fr/

Page 15: Profiling Web Archives

Profiling StudiesProfiling Studies

• TPDL 2013o 12 archives, March 2013, public web archives used

but techniques apply generallyo sampling only, no CDX access

• IJDL 2014 (to appear)o 15 archives (+4, -1), October 2013o slightly larger sample URI dataseto results similar

Page 16: Profiling Web Archives

URI Lookup = Limited InformationURI Lookup = Limited Information

16

GET /aggr/timegate/http://www.bnf.fr/ HTTP/1.1Host: mementoproxy.lanl.gov Accept-Datetime: Sun, 29 May 2005 02:46:53 GMTAccept-Language: fr; q=1.0, en; q=0.5…

1. Original URI2. Memento-Datetime3. Preferred URI

2

1

3

Page 17: Profiling Web Archives

Where to find Mementos for …Where to find Mementos for …

17

http://www.japantimes.co.jp/

Page 18: Profiling Web Archives

Where to find Mementos for …Where to find Mementos for …

18

http://www.japantimes.co.jp/

Page 19: Profiling Web Archives

Where to find Mementos for …Where to find Mementos for …

19

http://www.bnf.fr

Page 20: Profiling Web Archives

Where to find Mementos for …Where to find Mementos for …

20

http://www.bnf.fr

Page 21: Profiling Web Archives

Research QuestionResearch Question

Problem• Profile public web archives according to the following dimensions:

o Top-level domainso Languageso Growth rateo Archival date

Motivation• Determine who is archiving what• Optimize query routing for a Memento Aggregator

21

Page 22: Profiling Web Archives

Web Archives in this ExperimentWeb Archives in this Experiment

Full text URI-lookup

Internet Archive √

Library of Congress √

Icelandic Web Archive √

Library and Archives Canada √ √

British Library √ √

UK National Library √ √

Portuguese Web Archive √ √

Web Archive of Catalonia √ √

Croatian Web Archive √ √

Archive of the Czech Web √ √

National Taiwan University √ √

Archive It √ √22

Page 23: Profiling Web Archives

Experiment Set UpExperiment Set Up

• Sample URIs from seven different sources

• Retrieve the TimeMap for each URI from all archiveso A TimeMap lists all Mementos for a given URIo A Memento is an archived version of a resource

• Analyze who has holdings for which URIs

23

Page 24: Profiling Web Archives

Sampling URIs - DMOZSampling URIs - DMOZ1. DMOZ:Random

o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs).

2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs whichever is greatero 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net

2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au 764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319), (cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149), (tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw])

3. DMOZ:Languages - 100 URIs for each language1. 24 languages: Icelandic, Portuguese, Catalan, Afrikaans,

Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish , Russian, Turkish, Ukrainian

24

Page 25: Profiling Web Archives

• Query the fulltext search interface of select web archives with two sets of query terms.

4. Top 1-Gram from Bingo Most are English

5. Top 1000 query terms from Yahoo in 9 languageso Excluding general keywords such as: Obama, Facebook.

25

Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text

Page 26: Profiling Web Archives

26

Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text

Page 27: Profiling Web Archives

27

Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text

Page 28: Profiling Web Archives

Sampling URIs – User RequestsSampling URIs – User Requests

• Sampling from user requests for archived web resources

6. Sample from IA Wayback Machine Log fileso 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,

2012.

7. Sample from Memento Aggregator log fileso 100 URIs randomly sampled from LANL Memento Aggregator

between 2011 to 2013.

28

Page 29: Profiling Web Archives

Archive Coverage per SampleArchive Coverage per Sample

29

100%

35%

Entire Sample

Page 30: Profiling Web Archives

TLD Coverage across Archives (1)TLD Coverage across Archives (1)

30Entire Sample

Page 31: Profiling Web Archives

TLD Coverage across Archives (2)TLD Coverage across Archives (2)

31Entire Sample

Page 32: Profiling Web Archives

TLD Distribution per ArchiveTLD Distribution per Archive

32DMOZ:TLD Sample

Page 33: Profiling Web Archives

TLD Distribution per ArchiveTLD Distribution per Archive

33Web Archives Full Text Sample

Page 34: Profiling Web Archives

Language Coverage per ArchiveLanguage Coverage per Archive

34DMOZ Sample

Page 35: Profiling Web Archives

Archive Growth RateArchive Growth Rate

35Entire Sample

Page 36: Profiling Web Archives

Query Routing EvaluationQuery Routing Evaluation

36

Page 37: Profiling Web Archives

Study ResultsStudy Results• Introduced sampling to profile web archives using

available infrastructure, no privileged access

• Coverage:o Internet Archive provides broad coverageo National archives have good coverage for their domainso Surprising coverage by certain archives

• Query Routing:o In 84% of the cases, all existing Mementos for a TLD can be

found by using IA and two additional top archives for a TLDo In 55% of the cases, all existing Mementos for a TLD can be

found by using the top 3 archives for a TLD, excluding IA

37

Page 38: Profiling Web Archives

Next Steps With the IIPCNext Steps With the IIPC

38

• Finding the right granularityo too fine:

http://www.bnf.fr/fr/evenements_et_culture/a.passe_bnf.htmlo too coarse: .fro just right?: bnf.fr, www.bnf.fr, gallica.bnf.fr, www.bnf.fr/fr/

• Generating profileso what are desirable / representative sample sets: domains,

languages, regions, etc. -- what's missing?o local CDX analysis tools (can help with cold start problem)

• Profile formato community input (yet another metadata format)o github (or other tools) for exchange & integration

Page 39: Profiling Web Archives

{"Profile":{ "Name":"Taiwan Web Archive", "URI":"http://webarchive.lib.ntu.edu.tw", "TimeGate": "http://mementoproxy.cs.odu.edu/tw/timegate/", "Code":"TW", "Age":"Tue, 15 Jul 1997 00:00:00 GMT", "TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04}, {"eg":0.04},{"gov":0.04},{"my":0.04}, {"jp":0.04},{"kr":0.02}], "Language":[{"zh-TW":0.5},{"zh-CN":0.25}, {"id":0.08},{"ar":0.08}], "GrowthRate":[ {"199707":[4,4]},{"200202":[1,1]}, {"200607":[30,62]},{"200608":[20,80]}, {"200609":[5,9]},{"200612":[77,129]}, ... // other values truncated {"201308":[7,94]},{"201309":[2,94]}] }}

A Possible SerializationA Possible Serialization

Page 40: Profiling Web Archives
Page 41: Profiling Web Archives
Page 42: Profiling Web Archives

{Light, Dim, Dark} Archives{Light, Dim, Dark} Archives

42

• Work to date has assumed light archives because our focus has been on sampling archives we don't control

• Applicable to a continuum of archives:o download/fork and run "dark-sample.py"o it accesses sample URIs from IIPC githubo issues URI lookups to local archiveo write/update your archive profile in IIPC github with machine

readable IP restrictions o all profiles -- light/dim/dark -- now available to Memento

aggregators and other IIPC analysis tools

Page 43: Profiling Web Archives

Profiles = Easy Discovery, SharingProfiles = Easy Discovery, Sharing

http://netpreserve.org/aggr/timemap/link/1/http://www.bnf.fr/