1 archive-it training university of maryland july 12, 2007
TRANSCRIPT
![Page 1: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/1.jpg)
1
Archive-It Training
University of Maryland
July 12, 2007
![Page 2: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/2.jpg)
2
Archive-It Mission
Help memory institutions preserve the Web
• Provide web based archiving and storage capabilities
• No technical infrastructure required
• User-friendly application
![Page 3: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/3.jpg)
3
Archive-It Application Open Source Components
• Heritrix: web crawler• Arc File: archival record format (ISO work
item)• Wayback Machine: access tool for viewing
archived websites (Arc files)• Nutchwax: bundling of Nutch (an open
source search engine) used to make archived sites full text searchable
• All developed by Internet Archive: http://archive-access.sourceforge.net/
![Page 4: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/4.jpg)
4
Web Archiving Definitions
• Host: a single or set of networked machines, designated by its Internet hostname (ex, archive.org)
• Scope: rules for where a crawler can go
• Sub-domains: divisions of a larger site named to the left of the host name (ex. crawler.archive.org)
![Page 5: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/5.jpg)
5
Web Archiving Definitions
• Seed: starting point URL for the crawler. The crawler will follow linked pages from your seed url and archive them if they are in scope.
• Document: any file with a distinct URL (image, pdf, html, etc).
![Page 6: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/6.jpg)
6
General Crawling Limitations
Some web content cannot be archived:• Javascript: can be difficult to capture and even
more difficult to display• Streaming Media• Password protected sites• Form driven content: if you have to interact with
the site to get content, it cannot be captured.• Robots.txt: The crawler respects all robots.txt files
(go to yourseed.com/robots.txt to see if our crawler is blocked)
![Page 7: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/7.jpg)
7
Archive-It Crawling Scope
• Heritrix will follow links within your seed site to capture pages
• Links are in scope if they the seed is included in the root of their URL
• All embedded content on seed pages is captured• Sub-domains are NOT automatically crawled• Can specify path (i.e. limit crawler to single directory*
of host) - ex: www.archive.org/about/
*Always end seed directories with a ‘/’
![Page 8: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/8.jpg)
8
Seed and Scope Examples
Example seed www.archive.org• link: www.archive.org/about.html is in scope• link: www.yahoo.com is NOT in scope• embedded pdf: www.rlg.org/studies/metadata.pdf is in
scope• Embedded image: www.rlg.org/logo.jpg is in scope• link: crawler.archive.org NOT in scope
Example Seed www.archive.org/about/ • Link www.archive.org/webarchive.html NOT in scope
![Page 9: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/9.jpg)
9
Changing Crawl Scope
• Expand crawl scope to automatically include sub-domains using Scope Rules on the ‘edit’ seed page
• Use ‘crawl settings’ to constrain your crawl by limiting overall # of documents archived, block or limit specific hosts by document number or regular expression.
![Page 10: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/10.jpg)
10
Access
• Archived pages are accessible in the Wayback Machine 1 hour after crawl is complete (sooner for larger crawls)
• Text Searchable 7 days after crawl is complete• Public can see your archives through text search on
www.archive-it.org, Archive-It templates web pages (hosted on archive-it.org), or partner made portals.
![Page 11: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/11.jpg)
11
![Page 12: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/12.jpg)
12
Creating Collections
![Page 13: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/13.jpg)
13
Creating CollectionsYour collection needs:• A name chosen by your institution• A unique collection identifier: this is an abbreviated
version of your collection name• Seeds: these are the starting point URLs where the
crawler will begin its captures• Crawl frequency: how often your collection will be
crawled (you can change this at the seed level once the collection is created)
• Metadata: adding metadata is optional for your collection except for the collection description which will appear on public Archive-It site
![Page 14: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/14.jpg)
14
Crawl Frequency Options
• Daily crawls last 24 hours, all other crawls last 72.• Seed URLs within the same collection can be set to different
frequencies.• The Test frequency allows you to crawl seeds without gathering any
data so the crawl will not count against your total budget. In a test crawl all regular reports are generated. Test crawls only run for 72 hours and will crawl up to 1 million documents.
• Test crawls must be started manually (from Crawls menu).
![Page 15: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/15.jpg)
15
Managing Collections
![Page 16: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/16.jpg)
16
Editing Seeds
![Page 17: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/17.jpg)
17
• Enabled: Scheduled for crawling (limited to 3)• Disabled: publicly accessible, not scheduled for crawling
(unlimited)• Dormant: publicly accessible, not scheduled for crawling
(unlimited)
![Page 18: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/18.jpg)
18
Crawl Settings
• Advanced crawl controls: crawl and host constraints
• All controls found under crawl settings link
![Page 19: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/19.jpg)
19
Crawl Constraints
• Limit the number of documents captured per crawl instance (by frequency)
• Captured URL totals could be up to 30 documents over limit, due to URLs in crawler queue at the time limit is reached
![Page 20: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/20.jpg)
20
![Page 21: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/21.jpg)
21
Host Constraints
• Block or limit specified hosts from being crawled
• Blocks/limits apply to all named sub-domains of a host
• Using Regular Expressions here is OPTIONAL
![Page 22: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/22.jpg)
22
![Page 23: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/23.jpg)
23
Monitoring Crawls
![Page 24: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/24.jpg)
24
Monitoring Crawls
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 25: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/25.jpg)
25
Manually Starting a Crawl
• Select the crawl frequency you want to start
• Using this feature will change your future crawl schedule
• Should always be used to start test crawls
• Crawl should start within 5 minutes of start
![Page 26: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/26.jpg)
26
Reports
![Page 27: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/27.jpg)
27
Reports are available by crawl instance
![Page 28: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/28.jpg)
28
Archive-It provides 4 downloadable, post-crawl reports
• Top 20 Hosts: lists the top 20 hosts archived
• Seed Status: reports whether seed was crawled, show if the seed redirected to a different URL and if robots.txt file blocked the crawler
• Seed Source: shows how many documents and which hosts were archived per seed
• MIME type: lists all the different types of files archived
![Page 29: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/29.jpg)
29
Reports can be opened in Excel
Above is a portion of the seed source report
![Page 30: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/30.jpg)
30
Offsite Hosts in Reports
• Embedded content on a website can have a different originating host than the main site address– www.archive.org can contain content from
www.rlg.org in the form of a logo or any other embedded element on an www.archive.org page
– When seed www.archive.org is crawled, rlg.org will show up in the host reports even though it was not a seed
![Page 31: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/31.jpg)
31
Search
![Page 32: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/32.jpg)
32
Search results include hits from any seed metadata entered.
![Page 33: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/33.jpg)
33
![Page 34: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/34.jpg)
34
Wayback Machine
• Displays page as it was on the date of capture
• The date of capture is displayed in the archival URL, breaks down as yyyymmddhhmmsshttp://wayback.archive-it.org/270/20060801211637/http://sfpl.lib.ca.us/
was captured on August 1, 2006 at 21:16:37 GMT
![Page 35: 1 Archive-It Training University of Maryland July 12, 2007](https://reader034.vdocuments.mx/reader034/viewer/2022051619/56649db65503460f94aa88cd/html5/thumbnails/35.jpg)
35
Archive-It Help
• Online help wiki (link within application)
• Partner Specialist for support (including technical)
• List serv: [email protected]
• Report all technical bugs, issues and questions to [email protected]