1 news and media websites harvesting. 2 a daily crawl since december 2010 the selective crawl...
TRANSCRIPT
1
News and media websites harvesting
2
A daily crawl since December 2010
• The selective crawl contains 92 websites• National daily newspapers (http://www.lemonde.fr)• Regional daily newspapers (http://
www.charentelibre.fr)• News agencies (http://fr.reuters.com)• Web sites buzz (http://www.buzzactus.com)• News portal (http://actu.orange.fr)
3
A specific profile “News”, based on “Page + 1 click”
• The crawl is stopped after 23 hours<newObject name="RuntimeLimitEnforcer"
class="org.archive.crawler.prefetch.RuntimeLimitEnforcer"> <boolean name="enabled">true</boolean> <newObject name="RuntimeLimitEnforcer#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"/> </newObject> <long name="runtime-sec">82800</long> <string name="end-operation">Terminate job</string></newObject>
• The scope of the crawl• Max-hops = 1 (for the others, we use 20)• Max-trans-hop = 2 (for the others crawls, we use 3)
• Delay between each query server• max-retries = 10 (for the others, we use 30) and retry-delay-seconds =
60 (we use 900)
4
A few key statistics…
• For the first 3 quarters :– 81 672 059 URL collected– 511,86 Go (compressed)
• In one year, it will represent about :– 109 000 000 URL collected = 18 % of our
annual budget– 700 Go (compressed) = 2,7 % of our annual
budget
5
Crawl quality
• The crawl finish in about 8 hours• The quality of the archives is quite good
• But the archives have their limits:– Some news articles are presented on 2 pages on the
active web site (http://fr.reuters.com)– The architecture of the website
(http://www.lemonde.fr)– The time to load pages’ loading in the Wayback
machine– Compressed code (http://www.francesoir.fr/)
6
Regional daily newspapersExample: Ouest-France
• It’s the biggest title: 47 editions• In the past, we tested the deposit of PDF files without
success• In line, the PDF’s newspaper isn’t free.
– A password is required to access the publication after subscription
• We added the password into the Heritrix profile but:– The login/password is available for 3 months only– Often, the crawler gets disconnected
• A big part of the site is programmed in JavaScript• Heritrix extracts a lot of false URLs from JavaScript• Any false URL causes a disconnect and leads to the login page• But Heritrix enters the password only once a job (the page is then
marked as “already seen” and is not collected again)– We have crawled the articles but not the integral PDF versions
7
8
Today…
• Do you crawl paid newspapers?– Do you use some password to crawl some
publications?– Or do you use only the IP addresses?– How do you save the passwords in NAS?
• What about their access?– Is it necessary to save the passwords in WB?– How do you communicate the passwords to the
researchers?