an evaluation of caching policies for memento timemaps
DESCRIPTION
JCDL2013 presentation by Justin F. BrunelleTRANSCRIPT
An Evaluation of Caching Policies for Memento TimeMaps
Justin F. Brunelle and Michael L. NelsonOld Dominion University
{jbrunelle, mln}@cs.odu.edu
JCDL 2013Indianapolis, Indiana
07/2013
Discovering Archived nasa.gov Pages
Archived Pages => mementosMementos identified by URI-M
Live Pages => resourcesResources identified by URI-R
2
3
TimeMaps: Lists of mementos<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original",
<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT",
<http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT",
<http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT",
<http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT",
<http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT",
<http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT",
<http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT",
<http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT",
<http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",
<http://api.wayback.archive.org/memento/20080903053412/http://www.nasa.gov/>;rel="memento";datetime="Wed, 03 Sep 2008 05:34:12 GMT",
<http://webarchive.nationalarchives.gov.uk/20080904014810/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 00:00:00 GMT",
<http://api.wayback.archive.org/memento/20080904055742/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 05:57:42 GMT",
<http://webarchive.nationalarchives.gov.uk/20080906134025/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 00:00:00 GMT",
<http://api.wayback.archive.org/memento/20080906143204/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 14:32:04 GMT",
<http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",
<http://api.wayback.archive.org/memento/20080907160232/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 16:02:32 GMT",
<http://webarchive.nationalarchives.gov.uk/20120809003120/http://www.nasa.gov/>;rel="memento";datetime="Thu, 09 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120814175606/http://www.nasa.gov/>;rel="memento";datetime="Tue, 14 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120819212348/http://www.nasa.gov/>;rel="memento";datetime="Sun, 19 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120826185010/http://www.nasa.gov/>;rel="memento";datetime="Sun, 26 Aug 2012 00:00:00 GMT",
<http://webarchive.nationalarchives.gov.uk/20120909230516/http://www.nasa.gov/>;rel="last memento";datetime="Sun, 09 Sep 2012 00:00:00 GMT"
<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT"
http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT",
4
Aggregating TimeMapes
• Multiple archives• Expensive• Caching reduces
load on archives• Write-through
Cache
Aggre-gator
Sort
IA TM
AIT TM
HTTPCache
…
5
Aggregator Cache
• TimeMaps change• Only want to cache better TimeMaps
– Bigger is better
• Ideally monotonically increasing• Two extremes:
– Never cache (TTL=0)– Never update in cache (TTL=92)
6
Agenda
7
Cache content measures
• |a| => # of archives<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/
>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,
• |m| => # of mementos<http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/
>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”,
8
Same TimeMap
• |a| == |a'|• |m| == |m'|All archives have reported the same mementos.
TimeMap T
9
mm mm
mm
TimeMap T'
mm mm
mm
|a| = 2; |m| = 3 |a| = 2; |m| = 3
Gained Archives, Gained Mementos• |a| < |a`|• |m| < |m`|A new archive (WebCite) has just indexed and
reported a memento for the first time.
10
TimeMap T
mm mm
mm
TimeMap T'
mm mm
mm
mm
|a| = 2; |m| = 3 |a| = 3; |m| = 4
• |a| == |a`|• |m| < |m`|The Internet Archive has released a set of new
mementos.
11
TimeMap T
mm mm
mm
TimeMap T'
mm mm
mm mm
Same Archives, Gained Mementos
|a| = 2; |m| = 3 |a| = 2; |m| = 4
Lost Archives, Same Mementos• |a| > |a`|• |m| == |m`|A redaction of 1 memento took place in the Internet Archive which
now does not report mementos for this resource. The UK Web Archive has released 1 new memento for this resource.
1212
TimeMap T '
mm mm
mm
TimeMap T
mm
mm
mm
|a| = 3; |m| = 3 |a| = 2; |m| = 3
Lost Archives, Gained Mementos• |a| > |a`|• |m| < |m`|A redaction of 2 mementos took place in the Internet Archive which
now does not report mementos for this resource. The UK Government Web Archive has released 3 new mementos for
this resource.
13
TimeMap T
mm mm
mm
TimeMap T'
mm
mmmm
mm
|a| = 2; |m| = 3 |a| = 1; |m| = 4
Lost Archives, Lost Mementos• |a| > |a`|• |m| > |m`|Archive-It has removed a collection, and no longer reports
those mementos. No other archives have new mementos of those resources.
14
TimeMap T
mm mm
mm
TimeMap T'
mm
|a| = 2; |m| = 3 |a| = 1; |m| = 1
Gained Archives, Lost Mementos• |a| < |a`|• |m| > |m`|A new archive (WebCite) has just indexed and reported 1 memento for
the first time.A server error at the Internet Archive caused an omission of 2
mementos.
15
TimeMap T
mm mm
mm
|a| = 2; |m| = 4
TimeMap T'
mm
mm
mm
|a| = 3; |m| = 3
mm
Agenda
16
Experiment Design
• Eliminate caching from local Memento proxies• Daily observations of 4,000 TimeMaps for 92 days in 2013• TimeMaps analyzed for changes & cardinality• Investigated caching policies• Outages observed from Memento/archives/department
17
ObservationsOccurrence Description Action
77.4% Unchanged TimeMap Do not update cache
19.7% Lost archives, lost mementos Do not update cache
2.4% Gained archives, gained mementos Update cache
0.4% Same archives, gained mementos Update cache
0.1% Gained archives, lost mementos Do not update cache
0.01% Lost archives, same mementos Update cache
0.01% Lost archives, gained mementos Update cache
18
Impact of Change in TimeMaps
• Caching transient errors– Not returned or not archived?
19
Cardinality of TimeMaps<http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original", <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT", <http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT", <http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT", <http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT", <http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT", <http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT", <http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT", <http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT", <http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT",
…
|TM| ?
20
Strict vs. Loose Matching• Different archive, URI-M, datetime- Strict: 2, Loose: 2
<http://api.wayback.archive.org/memento/20080509125659/http://flare.prefuse.org/>;rel="memento";datetime="Fri, 09 May 2008 12:56:59 GMT",<http://webarchive.nationalarchives.gov.uk/20080908074106/http://flare.prefuse.org/>;rel="memento"; datetime="Mon, 08 Sep 2008 00:00:00 GMT",
• Same archive, datetime, different URI-M- Strict: 3, Loose: 1<http://web.archive.org/web/20101101060204/http://aarp.org:80/Health/>;rel="memento";
datetime="Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/Health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",<http://web.archive.org/web/20101101060204/http://www.aarp.org:80/health/>;rel="memento";datetime=“Mon, 01 Nov 2010 06:02:04 GMT",
• Same archive, different URI-M, bad datetime- Strict: 2, Loose: 2<http://wayback.archive-it.org/2342/20110321192906/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"
<http://wayback.archive-it.org/2354/20110321035356/http://www.apple.com/iphone/find-my-iphone-setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT"
21
Strict vs. Loose: translate.google.com
22
Agenda
23
Testing• TTLs [0, 92]
– 0: Thrashed cache, best freshness– 92: First TimeMap cached, no replacement
• Policies– Unconditional
• Cardinality ignored
– Conditional• Replacements occur when cardinality is better
24
Evaluation
• Minimize cost values:– Q – Queries to the archives– MemDays – number of missed mementos/day
• Calculated MemDays: mementos missed/day
TTL: ∞
TTL: 0 MemDays
Q
25
MemDays
26
6
|TM|=10
MemDay=8
Optimal TTLUnconditional
Conditional
Optimal TTL= 9
Optimal TTL= 15
27
Agenda
28
Conclusion & Future Work
• 3-month observation of 4,000 TimeMaps• Change patterns studied
– 80.2% of TimeMaps monotonically increase– Others decrease
• Optimal TTL = 15 days• Cache Improvements:
– Saves requests to the archives
• Worth reinvestigating– Changed Memento landscape
29
Backups
30
www.nasa.gov 1996 - 2012
31
MementoIntegrates the past and present web
Now
Always Current
2008 2006 200120082010
32
33
Cardinality• Size of a TimeMap
– # Archives?– # Date times?
• TimeMaps:
• Cardinality:
• Monotonic Increase:
34