link persistence, website persistence

Post on 08-May-2015

312 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation on the discrepancy between measurements of link persistence and website persistence and why it matters.

TRANSCRIPT

Link Persistence,Website

PersistenceNicholas

Taylor@nullhandle

May 28, 2013 “Forward” by Flickr user Hitchster under CC BY 2.0

why preserve the web?

variable (Sanderson, Phillips, and Van de

Sompel, 2011)• literature review of 17 studies• research focused on scholarly

citations• decay rates of 39-82%• over periods of 1-13 years

“Digital documents last forever—or five years, whichever comes first.”

(Jeff Rothenberg, 1997)

“Out of books sprout... plants” by DeviantArt user quinn.anya under CC BY-SA 2.0

LINK CHECKING

The Art and Science of

“http Blue Background” by DeviantArt user SoulArt2012 under CC BY-NC-ND 3.0

http response codes

• 404: “Not Found”• 200: “OK”• 301: “Moved Permanently”• 500: “Internal Server Error”

automated link checker

“La Machine @ Yokohama” by Flickr user chidorian under CC BY-SA 2.0

possible scenarios

• link works; same website• link works; different website

– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no longer

exists

link works; same websitehttp://www.fair.org/ (2002)

http://www.fair.org/ (2013)

link works; different website…

http://www.fb.com/ (2002)

http://www.fb.com/ (2013)

…but website still existshttp://www.fb.org/ (2013)

link doesn’t work…

http://www.state.mo.us/ (2002)

http://www.state.mo.us/ (2013)

…but website still existshttp://www.sos.mo.gov/ (2013)

link doesn’t work;website no longer exists

assumptions

• link works; same website• link works; different website

– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no

longer exists

research questions

• how much are we overestimating website persistence?– some working links point to different

websites• how much are we underestimating

website persistence?– websites may still exist even though

links don’t work or do work but point to different websites

WEB ARCHIVES

A Study Using

Library of CongressU.S. Election 2002 Web Archive

preparing the list of links

• exclude links corresponding to electoral candidate websites

• 1,071 links– state government– political parties– advocacy organizations– major newspapers– political blogs

methodology

automated• run Heritrix against

links, ignoring robots.txt

• log http response codes

• log redirects

manual• manually check each

link• same website behind

working link?• does website still

exist?

methodology

automated• run Heritrix against

links, ignoring robots.txt

• log http response codes

• log redirects

manual• manually check each

link• same website behind

working link?• does website still

exist?

working link?

91%

9%

workingnon-working

same website?

83%

9%

8%

working link; same site

non-working link

non-working link;website still exists?

91%

8%

2%

workingstill existsdoesn't exist

website still exists?

94%

6%

still existsdoesn't exist

summary of results

• how much are we overestimating website persistence?– 8% of working links point to different

websites• how much are we underestimating

website persistence?– 82% of websites associated with non-

working links still exist– 48% of websites whose links now point

to different websites still exist

what does it mean?

• websites are (much more) persistent than links

• websites are surprisingly durable?

“Golden Spider Silk” by Flickr user amandabhslater under CC BY-SA 2.0

WEBSITE CHECKING?

Beyond Link Checking,

“Check” by Flickr user ex.libris under CC BY-NC-ND 2.0

building a website checker

1. check whether link still works2. check whether link still corresponds

to website3. check whether website still exists

“Most web archiving problems are problems of scale.”

(Kris Carpenter Negulescu, 2012)

“chutes and ladders” by Flickr user reallyboring under CC BY-NC-SA 2.0

building a website checker

1. check whether link still works2. check whether link still corresponds

to website3. check whether website still exists

…but checksums are limited

“Hashing Emily” by Flickr user wlef70 under CC BY-NC-SA 3.0

visual analysis of page changes

Pehlivan, Ben-Saad, and Gançarski: “Vi-DIFF: Understanding Web Pages Changes”

building a website checker

1. check whether link still works2. check whether link still corresponds

to website3. check whether website still exists

lexical signature of archived page

Ware, Klein, and Nelson: “An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages”

find archived pages w/ Memento

• http protocol enhancement

• enables discovery of archived resources in distributed web archives

lexical signatures of backlink pages

“The future is already here; it’s just not very evenly distributed.”

(William Gibson, 1999)

“Time Travel” by Flickr user xcalibr under CC BY-NC-ND 2.0

top related