1 news and media websites harvesting. 2 a daily crawl since december 2010 the selective crawl...

Post on 26-Dec-2015

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

News and media websites harvesting

2

A daily crawl since December 2010

• The selective crawl contains 92 websites• National daily newspapers (http://www.lemonde.fr)• Regional daily newspapers (http://

www.charentelibre.fr)• News agencies (http://fr.reuters.com)• Web sites buzz (http://www.buzzactus.com)• News portal (http://actu.orange.fr)

3

A specific profile “News”, based on “Page + 1 click”

• The crawl is stopped after 23 hours<newObject name="RuntimeLimitEnforcer"

class="org.archive.crawler.prefetch.RuntimeLimitEnforcer"> <boolean name="enabled">true</boolean> <newObject name="RuntimeLimitEnforcer#decide-rules"

class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"/> </newObject> <long name="runtime-sec">82800</long> <string name="end-operation">Terminate job</string></newObject>

• The scope of the crawl• Max-hops = 1 (for the others, we use 20)• Max-trans-hop = 2 (for the others crawls, we use 3)

• Delay between each query server• max-retries = 10 (for the others, we use 30) and retry-delay-seconds =

60 (we use 900)

4

A few key statistics…

• For the first 3 quarters :– 81 672 059 URL collected– 511,86 Go (compressed)

• In one year, it will represent about :– 109 000 000 URL collected = 18 % of our

annual budget– 700 Go (compressed) = 2,7 % of our annual

budget

5

Crawl quality

• The crawl finish in about 8 hours• The quality of the archives is quite good

• But the archives have their limits:– Some news articles are presented on 2 pages on the

active web site (http://fr.reuters.com)– The architecture of the website

(http://www.lemonde.fr)– The time to load pages’ loading in the Wayback

machine– Compressed code (http://www.francesoir.fr/)

6

Regional daily newspapersExample: Ouest-France

• It’s the biggest title: 47 editions• In the past, we tested the deposit of PDF files without

success• In line, the PDF’s newspaper isn’t free.

– A password is required to access the publication after subscription

• We added the password into the Heritrix profile but:– The login/password is available for 3 months only– Often, the crawler gets disconnected

• A big part of the site is programmed in JavaScript• Heritrix extracts a lot of false URLs from JavaScript• Any false URL causes a disconnect and leads to the login page• But Heritrix enters the password only once a job (the page is then

marked as “already seen” and is not collected again)– We have crawled the articles but not the integral PDF versions

7

8

Today…

• Do you crawl paid newspapers?– Do you use some password to crawl some

publications?– Or do you use only the IP addresses?– How do you save the passwords in NAS?

• What about their access?– Is it necessary to save the passwords in WB?– How do you communicate the passwords to the

researchers?

top related