1 news and media websites harvesting. 2 a daily crawl since december 2010 the selective crawl...

8
1 News and media websites harvesting

Upload: solomon-park

Post on 26-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

1

News and media websites harvesting

Page 2: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

2

A daily crawl since December 2010

• The selective crawl contains 92 websites• National daily newspapers (http://www.lemonde.fr)• Regional daily newspapers (http://

www.charentelibre.fr)• News agencies (http://fr.reuters.com)• Web sites buzz (http://www.buzzactus.com)• News portal (http://actu.orange.fr)

Page 3: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

3

A specific profile “News”, based on “Page + 1 click”

• The crawl is stopped after 23 hours<newObject name="RuntimeLimitEnforcer"

class="org.archive.crawler.prefetch.RuntimeLimitEnforcer"> <boolean name="enabled">true</boolean> <newObject name="RuntimeLimitEnforcer#decide-rules"

class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"/> </newObject> <long name="runtime-sec">82800</long> <string name="end-operation">Terminate job</string></newObject>

• The scope of the crawl• Max-hops = 1 (for the others, we use 20)• Max-trans-hop = 2 (for the others crawls, we use 3)

• Delay between each query server• max-retries = 10 (for the others, we use 30) and retry-delay-seconds =

60 (we use 900)

Page 4: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

4

A few key statistics…

• For the first 3 quarters :– 81 672 059 URL collected– 511,86 Go (compressed)

• In one year, it will represent about :– 109 000 000 URL collected = 18 % of our

annual budget– 700 Go (compressed) = 2,7 % of our annual

budget

Page 5: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

5

Crawl quality

• The crawl finish in about 8 hours• The quality of the archives is quite good

• But the archives have their limits:– Some news articles are presented on 2 pages on the

active web site (http://fr.reuters.com)– The architecture of the website

(http://www.lemonde.fr)– The time to load pages’ loading in the Wayback

machine– Compressed code (http://www.francesoir.fr/)

Page 6: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

6

Regional daily newspapersExample: Ouest-France

• It’s the biggest title: 47 editions• In the past, we tested the deposit of PDF files without

success• In line, the PDF’s newspaper isn’t free.

– A password is required to access the publication after subscription

• We added the password into the Heritrix profile but:– The login/password is available for 3 months only– Often, the crawler gets disconnected

• A big part of the site is programmed in JavaScript• Heritrix extracts a lot of false URLs from JavaScript• Any false URL causes a disconnect and leads to the login page• But Heritrix enters the password only once a job (the page is then

marked as “already seen” and is not collected again)– We have crawled the articles but not the integral PDF versions

Page 7: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

7

Page 8: 1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers ()

8

Today…

• Do you crawl paid newspapers?– Do you use some password to crawl some

publications?– Or do you use only the IP addresses?– How do you save the passwords in NAS?

• What about their access?– Is it necessary to save the passwords in WB?– How do you communicate the passwords to the

researchers?