an evaluation of caching policies for memento timemaps

34
An Evaluation of Caching Policies for Memento TimeMaps Justin F. Brunelle and Michael L. Nelson Old Dominion University {jbrunelle, mln}@cs.odu.edu JCDL 2013 Indianapolis, Indiana 07/2013

Upload: justin-brunelle

Post on 28-Nov-2014

1.138 views

Category:

Technology


2 download

DESCRIPTION

JCDL2013 presentation by Justin F. Brunelle

TRANSCRIPT

Page 1: An Evaluation of Caching Policies for Memento TimeMaps

An Evaluation of Caching Policies for Memento TimeMaps

Justin F. Brunelle and Michael L. NelsonOld Dominion University

{jbrunelle, mln}@cs.odu.edu

JCDL 2013Indianapolis, Indiana

07/2013

Page 2: An Evaluation of Caching Policies for Memento TimeMaps

Discovering Archived nasa.gov Pages

Archived Pages => mementosMementos identified by URI-M

Live Pages => resourcesResources identified by URI-R

2

Page 3: An Evaluation of Caching Policies for Memento TimeMaps

3

Page 4: An Evaluation of Caching Policies for Memento TimeMaps

TimeMaps: Lists of mementos<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original",

<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT",

<http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT",

<http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT",

<http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT",

<http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT",

<http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT",

<http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT",

<http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT",

<http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",

<http://api.wayback.archive.org/memento/20080903053412/http://www.nasa.gov/>;rel="memento";datetime="Wed, 03 Sep 2008 05:34:12 GMT",

<http://webarchive.nationalarchives.gov.uk/20080904014810/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 00:00:00 GMT",

<http://api.wayback.archive.org/memento/20080904055742/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 05:57:42 GMT",

<http://webarchive.nationalarchives.gov.uk/20080906134025/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 00:00:00 GMT",

<http://api.wayback.archive.org/memento/20080906143204/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 14:32:04 GMT",

<http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",

<http://api.wayback.archive.org/memento/20080907160232/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 16:02:32 GMT",

<http://webarchive.nationalarchives.gov.uk/20120809003120/http://www.nasa.gov/>;rel="memento";datetime="Thu, 09 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120814175606/http://www.nasa.gov/>;rel="memento";datetime="Tue, 14 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120819212348/http://www.nasa.gov/>;rel="memento";datetime="Sun, 19 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120826185010/http://www.nasa.gov/>;rel="memento";datetime="Sun, 26 Aug 2012 00:00:00 GMT",

<http://webarchive.nationalarchives.gov.uk/20120909230516/http://www.nasa.gov/>;rel="last memento";datetime="Sun, 09 Sep 2012 00:00:00 GMT"

<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT"

http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",

4

Page 5: An Evaluation of Caching Policies for Memento TimeMaps

Aggregating TimeMapes

• Multiple archives• Expensive• Caching reduces

load on archives• Write-through

Cache

Aggre-gator

Sort

IA TM

AIT TM

HTTPCache

5

Page 6: An Evaluation of Caching Policies for Memento TimeMaps

Aggregator Cache

• TimeMaps change• Only want to cache better TimeMaps

– Bigger is better

• Ideally monotonically increasing• Two extremes:

– Never cache (TTL=0)– Never update in cache (TTL=92)

6

Page 7: An Evaluation of Caching Policies for Memento TimeMaps

Agenda

7

Page 8: An Evaluation of Caching Policies for Memento TimeMaps

Cache content measures

• |a| => # of archives<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/

>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,

• |m| => # of mementos<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/

>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,

8

Page 9: An Evaluation of Caching Policies for Memento TimeMaps

Same TimeMap

• |a| == |a'|• |m| == |m'|All archives have reported the same mementos.

TimeMap T

9

mm mm

mm

TimeMap T'

mm mm

mm

|a| = 2; |m| = 3 |a| = 2; |m| = 3

Page 10: An Evaluation of Caching Policies for Memento TimeMaps

Gained Archives, Gained Mementos• |a| < |a`|• |m| < |m`|A new archive (WebCite) has just indexed and

reported a memento for the first time.

10

TimeMap T

mm mm

mm

TimeMap T'

mm mm

mm

mm

|a| = 2; |m| = 3 |a| = 3; |m| = 4

Page 11: An Evaluation of Caching Policies for Memento TimeMaps

• |a| == |a`|• |m| < |m`|The Internet Archive has released a set of new

mementos.

11

TimeMap T

mm mm

mm

TimeMap T'

mm mm

mm mm

Same Archives, Gained Mementos

|a| = 2; |m| = 3 |a| = 2; |m| = 4

Page 12: An Evaluation of Caching Policies for Memento TimeMaps

Lost Archives, Same Mementos• |a| > |a`|• |m| == |m`|A redaction of 1 memento took place in the Internet Archive which

now does not report mementos for this resource. The UK Web Archive has released 1 new memento for this resource.

1212

TimeMap T '

mm mm

mm

TimeMap T

mm

mm

mm

|a| = 3; |m| = 3 |a| = 2; |m| = 3

Page 13: An Evaluation of Caching Policies for Memento TimeMaps

Lost Archives, Gained Mementos• |a| > |a`|• |m| < |m`|A redaction of 2 mementos took place in the Internet Archive which

now does not report mementos for this resource. The UK Government Web Archive has released 3 new mementos for

this resource.

13

TimeMap T

mm mm

mm

TimeMap T'

mm

mmmm

mm

|a| = 2; |m| = 3 |a| = 1; |m| = 4

Page 14: An Evaluation of Caching Policies for Memento TimeMaps

Lost Archives, Lost Mementos• |a| > |a`|• |m| > |m`|Archive-It has removed a collection, and no longer reports

those mementos. No other archives have new mementos of those resources.

14

TimeMap T

mm mm

mm

TimeMap T'

mm

|a| = 2; |m| = 3 |a| = 1; |m| = 1

Page 15: An Evaluation of Caching Policies for Memento TimeMaps

Gained Archives, Lost Mementos• |a| < |a`|• |m| > |m`|A new archive (WebCite) has just indexed and reported 1 memento for

the first time.A server error at the Internet Archive caused an omission of 2

mementos.

15

TimeMap T

mm mm

mm

|a| = 2; |m| = 4

TimeMap T'

mm

mm

mm

|a| = 3; |m| = 3

mm

Page 16: An Evaluation of Caching Policies for Memento TimeMaps

Agenda

16

Page 17: An Evaluation of Caching Policies for Memento TimeMaps

Experiment Design

• Eliminate caching from local Memento proxies• Daily observations of 4,000 TimeMaps for 92 days in 2013• TimeMaps analyzed for changes & cardinality• Investigated caching policies• Outages observed from Memento/archives/department

17

Page 18: An Evaluation of Caching Policies for Memento TimeMaps

ObservationsOccurrence Description Action

77.4% Unchanged TimeMap Do not update cache

19.7% Lost archives, lost mementos Do not update cache

2.4% Gained archives, gained mementos Update cache

0.4% Same archives, gained mementos Update cache

0.1% Gained archives, lost mementos Do not update cache

0.01% Lost archives, same mementos Update cache

0.01% Lost archives, gained mementos Update cache

18

Page 19: An Evaluation of Caching Policies for Memento TimeMaps

Impact of Change in TimeMaps

• Caching transient errors– Not returned or not archived?

19

Page 20: An Evaluation of Caching Policies for Memento TimeMaps

Cardinality of TimeMaps<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original", <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT", <http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT", <http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT", <http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT", <http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT", <http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT", <http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT", <http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT", <http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",

|TM| ?

20

Page 21: An Evaluation of Caching Policies for Memento TimeMaps

Strict vs. Loose Matching• Different archive, URI-M, datetime- Strict: 2, Loose: 2

<http://api.wayback.archive.org/memento/20080509125659/http://flare.prefuse.org/>;rel="memento";datetime="Fri, 09 May 2008 12:56:59 GMT",<http://webarchive.nationalarchives.gov.uk/20080908074106/http://flare.prefuse.org/>;rel="memento"; datetime="Mon, 08 Sep 2008 00:00:00 GMT",

• Same archive, datetime, different URI-M- Strict: 3, Loose: 1<http://web.archive.org/web/20101101060204/http://aarp.org:80/Health/>;rel="memento";

datetime="Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/Health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",

• Same archive, different URI-M, bad datetime- Strict: 2, Loose: 2<http://wayback.archive-it.org/2342/20110321192906/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"

<http://wayback.archive-it.org/2354/20110321035356/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"

21

Page 22: An Evaluation of Caching Policies for Memento TimeMaps

Strict vs. Loose: translate.google.com

22

Page 23: An Evaluation of Caching Policies for Memento TimeMaps

Agenda

23

Page 24: An Evaluation of Caching Policies for Memento TimeMaps

Testing• TTLs [0, 92]

– 0: Thrashed cache, best freshness– 92: First TimeMap cached, no replacement

• Policies– Unconditional

• Cardinality ignored

– Conditional• Replacements occur when cardinality is better

24

Page 25: An Evaluation of Caching Policies for Memento TimeMaps

Evaluation

• Minimize cost values:– Q – Queries to the archives– MemDays – number of missed mementos/day

• Calculated MemDays: mementos missed/day

TTL: ∞

TTL: 0 MemDays

Q

25

Page 26: An Evaluation of Caching Policies for Memento TimeMaps

MemDays

26

6

|TM|=10

MemDay=8

Page 27: An Evaluation of Caching Policies for Memento TimeMaps

Optimal TTLUnconditional

Conditional

Optimal TTL= 9

Optimal TTL= 15

27

Page 28: An Evaluation of Caching Policies for Memento TimeMaps

Agenda

28

Page 29: An Evaluation of Caching Policies for Memento TimeMaps

Conclusion & Future Work

• 3-month observation of 4,000 TimeMaps• Change patterns studied

– 80.2% of TimeMaps monotonically increase– Others decrease

• Optimal TTL = 15 days• Cache Improvements:

– Saves requests to the archives

• Worth reinvestigating– Changed Memento landscape

29

Page 30: An Evaluation of Caching Policies for Memento TimeMaps

Backups

30

Page 31: An Evaluation of Caching Policies for Memento TimeMaps

www.nasa.gov 1996 - 2012

31

Page 32: An Evaluation of Caching Policies for Memento TimeMaps

MementoIntegrates the past and present web

Now

Always Current

2008 2006 200120082010

32

Page 33: An Evaluation of Caching Policies for Memento TimeMaps

33

Page 34: An Evaluation of Caching Policies for Memento TimeMaps

Cardinality• Size of a TimeMap

– # Archives?– # Date times?

• TimeMaps:

• Cardinality:

• Monotonic Increase:

34