harvesting and showing complicated sites using archive-it – status for some of our tests from...
TRANSCRIPT
Harvesting and showing complicated sites using archive-it – status for some of our tests from October 2014 – January 2015
January 2015
By Tue Hejlskov Larsen, netarchive.dk
Archive-it (AIT) Setup january 2015 Heritrix 3.3.0 snapshot Umbra - all seed URLs in AIT are crawled using Umbra and Heritrix.
<<When a crawl is run using Umbra, designated seeds are sent by Heritrix to a separate process that mimics the way a browser would access the seed URLs. This allows client-side script to be executed so that previously unavailable URLs can be detected for Heritrix to crawl. Umbra also gives Heritrix a flexible way to imitate human interactions with Web sites that were previously not possible, such as executing JavaScript through clicking or hovering the mouse over different Web page elements and scrolling down a page>>
Harvesting using”Only one page” from october 2014 to january 2015. Following help instructions here
https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=3113092
(and sorry, if i’m missing some of the instructions – AIT updates the instructions from time to time !):
Used Wayback browser in proxy mode : Internet Explore 9
http://www.b.dk/nationalt/danske-universiteter-dumper-internationalt
http://www.b.dk/nationalt/danske-universiteter-dumper-internationalt They can harvest jsincludes with articles
https://www.youtube.com/watch?v=iHNBl2aSJ9g
AIT VideoplayerNo commentsMissing some images
https://www.youtube.com/watch?v=iHNBl2aSJ9g
With Video playback in place - only with Firefox in proxy mode
http://twitter.com/Spolitik/
With tweets, images, video links
https://twitter.com/Spolitik/
No Mouse down Paging
https://twitter.com/Spolitik/
Tiny url’s oke.g. http://t.co/dJ0BmbSV9E
https://twitter.com/Spolitik/
Using AIT free text search found posts/comments older than showed – have some locale problems…
https://twitter.com/Spolitik/
With linked videos - not inplace
https://www.facebook.com/socialdemokraterne
Images, Posts and some comments Posts to page in full view History (mouse down) No view comments No view of previous comments Using freetext search I found
comments which could not be showed on the page
https://www.facebook.com/socialdemokraterne
https://wayback.archive-it.org/4897/20141027123826/https://www.facebook.com/socialdemokraterne/posts/10152451814408030#
http://instagram.com/socialdemokraterne
Images2 times mouse down paging
No proveniens topbarNo full imageNo show more button
http://www.tumblr.com/search/socialdemokratiet/
Posts and imagesWith big imagesNo notes
http://vimeo.com/77382505
With video - not in place
http://vimeo.com/77382505
https://www.google.com/culturalinstitute/collection/statens-museum-for-kunst?projectId=art-project
Images not inplaceNo zoom No streetview
Comparison of display capabilities between Archive-it Wayback and NAS Wayback in proxy mode (AIT/NAS)
Summary of the tests – performed from October 2014 to January 2015
*AIT Free text search and AIT GUI views are also used to test what can be showed out of the box with AIT tools Harvesting Frequency currently available by AIT : Twice Daily, NAS each 15 minutes.
AIT/NAS ? = Not yet tested
b.dk and dr.dk youtube twitter facebook instagram Tumblr Vimeo Google art projects
Articles/posts/tweets Yes/No Yes/No Yes/No Yes/No Yes/no Yes/No
Comments/notes/retweets No/No No/No Yes some*/No
No/No No/No ?/No
Likes Yes/No Yes/No Yes/No
Timeline/history No/No Yes some* /No
Yes/No Yes some*/No
Images in(location)/out(site in AIT GUI/NAS GUI)
Yes/some missing Yes/No Yes/No Yes/No Yes/No Yes out/No
Ads in(location)/out(site in AIT GUI/NAS GUI)
Yes some out*/No No/No ?/No ?/No ?/No
Tiny links Yes/No ?/No
Video in (location)/out(site in AIT GUI/NAS GUI)
?/No Yes in/No Yes out/no Yes out/no ?/No ?/No Yes out/No
Image “zoom” No/No Yes/No No/No
Streetview No/No