slides anu talkwebarchivingaug2012

55
Internet Content as Research Data Australian National University August 2012, Canberra Monica Omodei

Upload: roxanne-missingham

Post on 18-Nov-2014

496 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Slides anu talkwebarchivingaug2012

Internet Content as Research Data

Australian National University August 2012, Canberra

Monica Omodei

Page 2: Slides anu talkwebarchivingaug2012

Research Examples •  Social networking •  Lexicography •  Linguistics •  Network Science

•  Political Science •  Media Studies •  Contemporary history

Data-driven science is migrating from the natural sciences to humanities and social science

Page 3: Slides anu talkwebarchivingaug2012

Talk  Structure  

•  Exis0ng  web  archives  •  Web  archive  use  cases  •  Bringing  archives  together  •  Crea0ng  your  own  archive  •  It’s  ge>ng  harder  –  challenges  •  Web  data  mining  &  analysis        

Page 4: Slides anu talkwebarchivingaug2012

Exis0ng  web  archives    

•  Internet  Archive  •  Common  Crawl    •  Pandora  Archive  •  Internet  Memory  Founda0on  Archive  •  Other  na0onal  archives  •  Research,  University  Library  archives    

Page 5: Slides anu talkwebarchivingaug2012

Common  Collec0on  Strategies  

•  Crawl  Scope  &  Focus  1)  Thema0c/Topical  (elec0ons,  events,  global  warming…)  2)  Resource-­‐specific  (video,  pdf,  etc.)  3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)  4)  Exhaus0ve  (end  of  life, closure crawls, natl domains)  5)  Frequency-­‐Based    

•  Key  Inputs:  nomina0ons  from  subject  ma^er  experts,  prior  crawl  data,  registry  data,  trusted  directories,  wikipedia,  twi^er  

Page 6: Slides anu talkwebarchivingaug2012

Internet Archive’s Web Archive

Positives – Very broad – 175+ billion web instances – Historic – started 1996 – Publicly accessible – Time-based URL search – API access – Not constrained by legislation – covered by

fair use and fast take-down response

Page 7: Slides anu talkwebarchivingaug2012

Internet  Archive’s  Web  Archive  Negatives

– Because of size can’t search by keyword – Because of size crawling is fully automated –

ergo QA is not possible  

Page 8: Slides anu talkwebarchivingaug2012
Page 9: Slides anu talkwebarchivingaug2012
Page 10: Slides anu talkwebarchivingaug2012
Page 11: Slides anu talkwebarchivingaug2012

Common  Crawl  

•  Non-­‐profit  founda0on  building  an  open  crawl  of  the  web  to  seed  research  and  innova0on  

•  Currently  5  billion  pages  •  Stored  on  Amazon’s  S3    •  Accessible  via  MapReduce  processing  in  Amazon’s  EC2  compute  cloud  

•  Wholesale  extrac0on,  transforma0on,  and  analysis  of  web  data  cheap  and  easy  

Page 12: Slides anu talkwebarchivingaug2012

Common  Crawl  

Nega0ves  •  Not  designed  for  human  browsing  but  for  machine  access  

•  Objec0ve  is  to  support  large-­‐scale  analysis  and  text  mining/indexing  –  not  long-­‐term  preserva0on  

•  Some  costs  are  involved  for  direct  extrac0on  of  data  from  S3  storage  using  Requester-­‐Pays  API    

Page 13: Slides anu talkwebarchivingaug2012

Pandora  Archive  •  Posi0ves  

– Quality  checked  – Targeted  Australian  content  with  selec0on  policy  – Historical  –  started  1996  – Bibliocentric  approach  –web  sites/publica0ons  selected  for  archiving  are  catalogued  (see  Trove)  

– Keyword  search  – Publicly  accessible  – You  can  nominate  Australian  web  sites  for  inclusion  -­‐  pandora.nla.gov.au/registra0on_form.html  

Page 14: Slides anu talkwebarchivingaug2012
Page 15: Slides anu talkwebarchivingaug2012

Pandora  Archive  

•  Nega0ves  –  labour  intensive  thus  quite  small  – significant  content  missed  because  permission  to  copy  refused  

•  Situa0on  will  improve  markedly  if  Legal  Deposit  provisions  extended  to  digital  publica0ons  

•  Broader  coverage  will  be  achieved  when  infrastructure  is  upgraded  hence  reducing  labour  costs  for  checking/fixing  crawls  

Page 16: Slides anu talkwebarchivingaug2012

Pandora  Archive  Stats  

•  Size  –  6.32  TB  •  Number  of  Files    >  140  million  •  Number  of  ‘0tles’  >  30.5K  •  Number  of  0tle  instances  >  73.5K  

Page 17: Slides anu talkwebarchivingaug2012
Page 18: Slides anu talkwebarchivingaug2012
Page 19: Slides anu talkwebarchivingaug2012
Page 20: Slides anu talkwebarchivingaug2012
Page 21: Slides anu talkwebarchivingaug2012

Which archived sites are popular ?  •  Measure: filtered, aggregated web access

log data which counts access to “titles”"•  Examined top 30 archived titles (# of

accesses) for each year 2009 to 2012"•  Selected some to examine and speculate

as to why they might be popular"•  Selected those with consistently high

ranking, and ones that were very variable between years  

Page 22: Slides anu talkwebarchivingaug2012

Reasons for popularity of archived version  

•  Were once popular and are now decommissioned, particularly if domain name continues to exist and redirects to the archive"

•  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content"

•  Popular referencing sources cite the archive as well as the live site (if it still exists)  

Page 23: Slides anu talkwebarchivingaug2012
Page 24: Slides anu talkwebarchivingaug2012
Page 25: Slides anu talkwebarchivingaug2012
Page 26: Slides anu talkwebarchivingaug2012

Improving visibility and usage of Pandora archive  

•  Articles about interesting content on the Australia Web Archives blog –http://blogs.nla.gov.au/australias-web-archives/"

•  More effort to identify archived sites that are no longer ʻliveʼ"

•  Market automatic redirect services to web site owners/managers"

•  Allow Google to index archive content for ʻnon-liveʼ sites (problematic)"

•  Install Twittervane - draws  site  nomina0ons  for  archiving  based  on  trending  Twi^er  topics.      "

Page 27: Slides anu talkwebarchivingaug2012

.au  Domain  Annual  Snapshots  •  Annual  crawls  since  2005  commissioned  from  Internet  Archive  

•  Includes  sites  on  servers  located  in  Australia  as  well  as  .au  domain  

•  Robots.txt  respected  except  for  inline  images  and  stylesheets  

•  No  public  access  –  researcher  access  protocols  are  being  developed  

•  Full  text  search  –  suited  to  searching  archives  •  Separate  .gov  crawl  publicly  accessible  soon  

Page 28: Slides anu talkwebarchivingaug2012

Australian  web  domain  crawls  

Year   2005   2006   2007   2008   2009   2011  

Files   185  million  

596  million  

516  million  

1  billion   765  million  

660  million  

Hosts  crawled  

811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549  

Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  

Page 29: Slides anu talkwebarchivingaug2012

Internet  Memory  Founda0on  •  Number  of  European  partners    •  LiWA  –  Living  Web  Archives:  next  genera0on  Web  archiving  methods  and  tools    

•  LAWA  –  Longitudinal  Analy0cs  of  Web  Archive  Data:  experimental  testbed  for  large-­‐scale  data  analy0cs  

•  ARCOMEM  (Collect-­‐All  ARchives  to  COmmunity  MEMories)  leveraging  social  media  for  Intelligent  Preserva0on    

•  SCAPE  –  Scalable  Preserva0on  Environments  

Page 30: Slides anu talkwebarchivingaug2012
Page 31: Slides anu talkwebarchivingaug2012

Other  Na0onal  Archives  •  List  of  Interna0onal  Internet  Preserva0on  Consor0um  member  archives  –  netpreserve.org/about/archiveList.php  

•  Some  are  whole  domain  archives,  some    are  selec0ve  archives,  many  are  both  

•  Some  have  public  access,  others  you  will  need  to  nego0ate  access  for  research  

•  Most  archives  have  been  collected  using  the  heritrix  open-­‐source  crawler  and  thus  use  the  standard  format  (warc  ISO  format)  

Page 32: Slides anu talkwebarchivingaug2012

Research  Archives  •  California  Digital  Library  •  Harvard  University  Libraries  •  Columbia    University  Libraries  •  University  of  North  Texas  ….  and  many  more    •  WebCITE  -­‐  webcita0on.org  (cita0on  service  archive)  

Page 33: Slides anu talkwebarchivingaug2012

Example:  Columbia  University  •  Member  of  the  IIPC  •  They  use  the  ArchiveIt  service  •  A  Research  library  that  sees  web  archiving  as  fundamental  to  their  collec0ng    

•  They  complement  and  coordinate  with  other  web  archives  

•  Their  collec0ng  focus  is  thema0c  –  eg  human  rights,  historic  preserva0on,  NY  religious  ins0tu0ons  

•  They  also  archive  web  content  as  part  of  personal  and  organisa0onal  archives  (c.f.  manuscripts  coll)  

•  Archive  their  own  web  site  regularly  

Page 34: Slides anu talkwebarchivingaug2012
Page 35: Slides anu talkwebarchivingaug2012

Bringing  Archives  Together  

•  Common  standards  and  APIs  •  Memento  project  –  adding  0me  to  the  web  

– Aggregates  CDX  files  (URL  index)  from  mul0ple  archives  

– Has  a  Firefox  plug-­‐in  which  allows  0me-­‐based  browsing  

–  Ini0a0ve  of  Los  Alamos  Laboratories  – See  h^p://www.mementoweb.org/demo/  

 

Page 36: Slides anu talkwebarchivingaug2012
Page 37: Slides anu talkwebarchivingaug2012

Common  Use  Cases  for  a  web  archive  

•  Content  discovery  •  Nostalgia  queries  •  Web  site  restora0on  and  file  recovery  •  Domain  name  valua0on  •  Fall-­‐back  for  link-­‐rot  •  Prior  art  analysis  and  patent/copyright  infringement  research  

•  Legal  cases  •  Topic  analysis,  web  trends  analysis,  popularity  analysis,  network  analysis,  linguis0c  analysis  

Page 38: Slides anu talkwebarchivingaug2012

Create  your  own  Archive  

•  Use  a  subscrip0on  service  •  Build  your  own  web  archiving  infrastructure  with  open  source  sonware  (  ie  Heritrix  and  Wayback)  

•  Use  web  cita0on  services  that  create  archive  copies  as  you  bookmark  pages  

Page 39: Slides anu talkwebarchivingaug2012

Subscrip0on  Services  

•  archive-­‐it.org  (service  operated  by  non-­‐profit  Internet  Archive  since  2006)  

•  archivethe.net  (service  operated  by  non-­‐profit    Internet  Memory  Founda0on)  

•  California  Digital  Library  Web  Archiving  Service  -­‐  cdlib.org/services/uc3/was.html  

•  OCLC  Harvester  Service  -­‐  oclc.org/webharvester/overview/default.htm  

Page 40: Slides anu talkwebarchivingaug2012
Page 41: Slides anu talkwebarchivingaug2012

Install  web  archiving  system  locally  

•  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet  available    

•  Ins0tu0onal  web  archiving  infrastructure  is  feasible  and  has  been  established  at  a  number  of  universi0es  for  use  by  researchers  –  needs  IT  systems  engineers  to  set  up  though  

•  Archives  can  be  deposited  with  the  NLA  for  long-­‐term  preserva0on  

Page 42: Slides anu talkwebarchivingaug2012

Personal  Web  Archiving  

•  WARCreate  –  recently  released  free  tool  which  creates  wayback-­‐consumable  warc  files  from  any  web  page  

•  Google  Chrome  extension  •  Enables  preserva0on  by  users  from  their  desktop  •  Can  target  content  unreachable  by  crawlers  •  Brings  WARC  to  personal  digital  archiving  •  What  you  do  with  the  WARC  files  is  up  to  you  •  Install  suite  provided  to  set  up  local  Wayback  instance  and  Memento  0megate  

Page 43: Slides anu talkwebarchivingaug2012

Current  challenges  

•  Database-­‐driven  features  and  func0ons  •  Complex  and  varying  URI  formats  and  non-­‐standard  link  implementa0ons  eg  Twi^er  

•  Dynamically  generated  ever-­‐changing  URIs  – For  serving  the  same  resources  

•  Rich  Media  –  eg  streamed  media  with  custom  apps  and  ant-­‐collec0on  measures  

•  Scripted  incremental  display  and  page-­‐loading  

Page 44: Slides anu talkwebarchivingaug2012

…  more…  

•  Scripted  HTML  forms  •  Mul0-­‐sourced  embedded  material  •  Dynamic  authen0ca0on  e.g.  captchas,  cross-­‐site  authen0ca0on,  user-­‐sensi0ve  embeds  

•  Alternate  display  based  on  browser  or  device,  or  other  parameter  

•  Site  architecture  designed  to  inhibit  crawling  and  indexing  –  but  if  poorly  done  even  ‘polite’  harvesters  like  Heritrix  may  crash  their  server  

Page 45: Slides anu talkwebarchivingaug2012

..  but  wait,  there’s  more  …  

•  Server-­‐side  scripts  and  remote  procedure  calls  –  the  full  variety  of  paths  through  a  site  are  now  onen  hidden  in  remote/opaque  server-­‐side  code  –  not  a  new  problem  but  now  effects  80+%  of  online  resources  

•  HTML  5  web  sockets  –  effec0vely  codifies  incremental  updates  without  page  reloads  

•  Mobile  publishing  

Page 46: Slides anu talkwebarchivingaug2012

Transac0onal  Web  Archiving  •  Useful  for  ins0tu0onal  archiving    

– Best  for  record-­‐keeping  purposes  -­‐  when  challenged  in  court  about  content  on  web  site  

– Can  be  used  to  ensure  URL  persistence  eg  when  site  has  a  make-­‐over  –  can  intercept  404s      

– No  ‘gaps’  c.f.  crawl  approach  –  every  change  in  accessed  content  is  archived  

– However  requires  code  snippet  to  be  installed  on  web  server  

– Open  source  sonware  being  developed  by  Los  Alamos  Labs  

Page 47: Slides anu talkwebarchivingaug2012

Innovation is increasingly driven from Large scale Data Analysis

Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…

Web Data Mining & Analysis – What is it? Why Do It?

Page 48: Slides anu talkwebarchivingaug2012

Platform & Toolkit: Overview

•  Software – Apache Hadoop – Apache Pig

•  Data/File format – WARC – CDX – WAT (new!)

Page 49: Slides anu talkwebarchivingaug2012

Apache Hadoop

•  HDFS – Distributed storage – Durable, default 3x replication – Scalable: Yahoo! 60+PB HDFS

•  MapReduce – Distributed computation – You write Java functions – Hadoop distributes work across cluster – Tolerates failures

Page 50: Slides anu talkwebarchivingaug2012

File formats and data: WARC

Page 51: Slides anu talkwebarchivingaug2012

File formats and data: CDX

•  Index used to browse WARC-based archive •  Space-delimited text file •  Only essential the essential metadata needed

by Wayback – URL – Content Digest – Capture Timestamp – Content-Type – HTTP response code – etc.

Page 52: Slides anu talkwebarchivingaug2012

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright,

privacy •  Work-in-progress: we want your feedback

Page 53: Slides anu talkwebarchivingaug2012

File formats and data: WAT •  WAT is WARC ☺

– WAT records are WARC metadata records

– WARC-Refers-To header identifies original WARC record

•  WAT payload is JSON – Compact – Hierarchical – Supported by every

programming environ

File formats & data: •  CDX: 53 MB •  WAT: 443 MB •  WARC: 8,651 MB

Page 54: Slides anu talkwebarchivingaug2012

Some  References  

•  h^p://en.wikipedia.org/wiki/Web_archiving  •  h^p://netpreserve.org/about/archiveList.php  •  Web  Archives:  The  Future(s)  -­‐  h^p://www.netpreserve.org/publica0ons/2011_06_IIPC_WebArchives-­‐TheFutures.pdf  

•  h^p://matkelly.com/warcreate/  •  Common  Crawl:  h^p://commoncrawl.org/data/accessing-­‐the-­‐data/  

Page 55: Slides anu talkwebarchivingaug2012

Contacts  •  Webarchive  @  nla.gov.au  •  Secretariat  @  internetmemory.org  •  Queries  about  the  internet  archive  web  archive  h^p://iawebarchiving.wordpress.com/  

•  Queries  about  Archive-­‐It  service  h^p://www.archive-­‐it.org/contact-­‐us  

momodei  @  nla.gov.au  (un0l  31  Aug  2012  )  or  monica.omodei  @  gmail.com