memento update cni task force meeting, spring 2011 1 memento herbert van de sompel robert sanderson...
TRANSCRIPT
Memento UpdateCNI Task Force Meeting, Spring 2011 1
Mementohttp://mementoweb.org/
Herbert Van de Sompel Robert Sanderson
Michael L. Nelson
Giant Leaps Towards Seamless Navigationof the Web of the Past
Memento UpdateCNI Task Force Meeting, Spring 2011 2
Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento UpdateCNI Task Force Meeting, Spring 2011 3
Overview of Memento Framework
Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento UpdateCNI Task Force Meeting, Spring 2011 4
Memento wants to make it easy
to access the Web of the Past.
Memento UpdateCNI Task Force Meeting, Spring 2011 5
Tate OnlineToday
Select DateMarch 16 2008
Tate OnlineMarch 16 2008
FromNational Archives
Memento UpdateCNI Task Force Meeting, Spring 2011 6
Tate OnlineToday
Select DateMarch 16 2008
Tate OnlineMarch 16 2008
FromNational Archives
Dynamic Static
Memento UpdateCNI Task Force Meeting, Spring 2011 7
Memento achieves this by introducing
a uniform version access capability to
integrate the present and past Web.
Memento UpdateCNI Task Force Meeting, Spring 2011 8
Content Management Systems:
• Designed to be aware of all versions of a resource
• Self-contained
• Variety of proprietary version mechanisms
• Versions interlinked using proprietary mechanisms
• Dynamism is managed
Memento UpdateCNI Task Force Meeting, Spring 2011 9
World Wide Web:
• Designed to forget about prior versions of a resource
• Distributed
• Dynamism from a management perspective is ignored
Memento UpdateCNI Task Force Meeting, Spring 2011 10
There are resource versions on the Web:
• Content management systems
• Web archives
• Transactional archives
• Search engine caches
Memento UpdateCNI Task Force Meeting, Spring 2011 11
But the Web architecture has a hard time dealing with them:
• Cannot talk about a resource as it used to exist
• Cannot access a prior version knowing the current one
• Cannot access the current version knowing a prior one
Current approaches are ad hoc and localized
Memento UpdateCNI Task Force Meeting, Spring 2011 12
Memento:
• Regards the Web as a big Content Management System
• Introduces a uniform capability to access versions on the Web
• Does not build new archives but leverages all systems that host versions: Web archives, Content Management Systems, Software Version Systems, etc.
Memento UpdateCNI Task Force Meeting, Spring 2011 13
Memento’s version access approach:
• Is distributed: versions may exist on several servers
• Uses time as a global version indicator
• Is based on the primitives of the Web: resource, resource state, representation, content negotiation, link
Memento UpdateCNI Task Force Meeting, Spring 2011 14
Original Resource and Versions
Memento UpdateCNI Task Force Meeting, Spring 2011 15
Bridge from Present to Past
Memento UpdateCNI Task Force Meeting, Spring 2011 16
Bridge from Past to Present
Memento UpdateCNI Task Force Meeting, Spring 2011 17
Memento Framework
Memento UpdateCNI Task Force Meeting, Spring 2011 18
Multiple Archives
Memento UpdateCNI Task Force Meeting, Spring 2011 19
Memento Client-Server Interaction
Memento UpdateCNI Task Force Meeting, Spring 2011 20
Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento UpdateCNI Task Force Meeting, Spring 2011 21
Significant progress has been made towards
seamless navigation of the Web of the Past.
Memento UpdateCNI Task Force Meeting, Spring 2011 22
Standardization
• Standardization process started via the IETF
• Interest from IETF and W3C
• Encouraged by major Web architects, including: Tim Berners-Lee, Mark Nottingham, Michael Hausenblas
https://datatracker.ietf.org/doc/draft-vandesompel-memento/
Memento UpdateCNI Task Force Meeting, Spring 2011 23
Memento Clients
• Several client tools developed by us and others
• Add-ons for FireFox (operational) and Internet Explorer (experimental)
• Applications for Android (operational) and iPhone/iPad (in development)
• Paper in next issue ofCode4Lib Journal
http://www.mementoweb.org/tools/
Memento UpdateCNI Task Force Meeting, Spring 2011 24
Memento Server Support (1)
• Memento-compliant Wayback software:• Used by Internet Archive
• Available to Web archives, worldwide
• Please have your favorite Web Archive install this new version 1.6!
http://www.mementoweb.org/tools/
Memento UpdateCNI Task Force Meeting, Spring 2011 25
Memento Server Support (2)
• Plug-in for MediaWiki (operational)
• Used on W3C’s main wiki
• Please install it for your MediaWiki!
http://www.mementoweb.org/tools/
Memento UpdateCNI Task Force Meeting, Spring 2011 26
Memento Server Validator
• Server side client:• Attempts to perform all
Memento actions against a given URI
• Reports success/failure of the interactions and warnings for optional aspects
• Kept up to date with IETF Internet Draft
http://www.mementoweb.org/tools/
Memento UpdateCNI Task Force Meeting, Spring 2011 27
Memento Proxy Support
• Several systems that host Mementos made Memento-compliant “by proxy”:
• All major Web Archives that do not yet run Memento-compliant Wayback software
• 3,000+ MediaWiki systems, including Wikipedia
• We want all of these to become natively Memento compliant!
Memento UpdateCNI Task Force Meeting, Spring 2011 28
Memento Website
• Ongoing effort to add materials that support understanding and adoption:• Introduction to Memento• How to recognize
Mementos, TimeGates, Original Resources?
• Guidelines for servers that host Mementos (Web Archives, CMS, snapshot archives, etc.)
http://www.mementoweb.org/guide/
Memento UpdateCNI Task Force Meeting, Spring 2011 29
Funding
• 2007-2010: US $250K grant from Library of Congress
• Approx. 50K on Memento
• 2010-2011: US $1 Million follow-up grant from Library of Congress
• For: Specification, outreach, tool development, further research
Memento UpdateCNI Task Force Meeting, Spring 2011 30
Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento UpdateCNI Task Force Meeting, Spring 2011 31
Memento Time Travel is really powerful.
Time-Series Data via HTTP follow-your-nose.
Memento UpdateCNI Task Force Meeting, Spring 2011 32
Memento Framework
Memento UpdateCNI Task Force Meeting, Spring 2011 33
Original Resource: http://lanlsource.lanl.gov/pics/picoftheday.png
Time Series for Humans
Memento UpdateCNI Task Force Meeting, Spring 2011 34
Data collected through HTTP Navigation
Time Travel across versions of a Picture of the Day
Memento UpdateCNI Task Force Meeting, Spring 2011 35
Thanks Christine!
time
change
Data
time
Process
time
Reproducibility
But if we had static, discoverable snapshots of the data and the process…
Memento UpdateCNI Task Force Meeting, Spring 2011 36
Original Resource: http://dbpedia.org/resource/France
Time Series for Machines
Memento UpdateCNI Task Force Meeting, Spring 2011 37
Data collected through HTTP Navigationpaper at http://arxiv.org/abs/1003.3661
Time Travel across versions of DBPedia
Memento UpdateCNI Task Force Meeting, Spring 2011 38
Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento UpdateCNI Task Force Meeting, Spring 2011 39
Very few Web sites provide a “timegate” link.
Need additional mechanisms to support Discovery.
Memento UpdateCNI Task Force Meeting, Spring 2011 40
Batch discovery of Mementos: TimeMaps
A TimeMap minimally lists:
• URI and datetime of Mementos known to an archive• URI of Original Resource
TimeMaps can be aggregated across systems that host Mementos
Memento UpdateCNI Task Force Meeting, Spring 2011 41
Batch discovery of Mementos: Feed of TimeMaps
• System that host Mementos exposes Feed (e.g. Atom) of TimeMaps to allow applications to remain in sync with its evolving Memento collection:
• One Atom entry per Original Resource for which system hosts Mementos• The entry provides a “timemap” link to a TimeMap for the Original Resource• The datetime value of the updated field of the entry changes when additional Memento for Original Resource becomes available (i.e. TimeMap changes)• The ID of the entry is a tag URI based on URI of Original Resource
Will be proposed to IIPC
Memento UpdateCNI Task Force Meeting, Spring 2011 42
Batch discovery of Mementos: robots.txt
• robots.txt file is used by Web servers to convey crawling policies
• Add a directive to support discovery of Mementos known to the server:
• Pointer to a single Memento can suffice as the robot can crawl on from there• Mementos allow for discovery of TimeMaps via HTTP links• e.g. jcdl.org hosts snapshot archives of prior JCDL conferences and adds the following to its robots.txt
Memento: jcdl.org/archive/2002/index.html
Will be promoted via Internet Draft
Memento UpdateCNI Task Force Meeting, Spring 2011 45
Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento UpdateCNI Task Force Meeting, Spring 2011 46
Memento can recreate pages using resources from different archives.
This poses a branding challenge.
Memento UpdateCNI Task Force Meeting, Spring 2011 47
Current Branding Practice for Web Archives
Page and embedded resources from same Web Archive
Brandingfor
pageand
embedded resources
Memento UpdateCNI Task Force Meeting, Spring 2011 48
Branding for Web Archives in Memento Mode
Will be researched
Page and embedded resources from various Web Archives
Page branding
Nobranding
Nobranding
Memento UpdateCNI Task Force Meeting, Spring 2011 49
Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento UpdateCNI Task Force Meeting, Spring 2011 50
Crawl-based Archives host distinct observations.
Transactional Archives never miss an update.
Memento UpdateCNI Task Force Meeting, Spring 2011 51
Crawl-Based Web Archives
Observations
For example: Heritrix crawler for Internet Archive
Memento UpdateCNI Task Force Meeting, Spring 2011 53
Server-Side Transactional Web Archives
Change History
For example: TTApache, PageVault, Vignette Web Capture
Memento UpdateCNI Task Force Meeting, Spring 2011 55
Development of Transactional Web Archive Software
Submit:• Java-Grizzly-Jersey submission interface application• Berkeley DB metadata store• FS store for body and headers
Capture:• Apache connection filter module (mod_ta) captures URI, headers, body• Module POSTs in real-time to transactional archive’s Submit URI
Memento UpdateCNI Task Force Meeting, Spring 2011 56
Development of Transactional Web Archive Software
Development timeline:• Ongoing development (LANL) and testing (ODU)• Submit/Access finalized; development focus on collection management• Expected release as open source, 3rd Quarter 2011
Access:• Transactional archive natively supports Memento• Immediate availability of archived content• Export of WARC, e.g. for long-term archiving in other environment
Memento UpdateCNI Task Force Meeting, Spring 2011 57
Mementohttp://mementoweb.org/
Herbert Van de SompelRobert SandersonMichael L. Nelson
Giant Leaps Towards Seamless Navigation of the Web of the Past