Информационный поиск, осень 2016: Получение данных для...

Crawling, main questions Practical considerations Spam Summary

Information RetrievalData Acquisition

Ilya [email protected]

University of Amsterdam

Ilya Markov [email protected] Information Retrieval 1


Course overview

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced



This lecture

Data Acquisition

Data Processing

Data Storage


Aggregated Search

Click Models


Offline

Online

Advanced



Data acquisition methods

Downloading

Feeds

API

Metadata harvesting

Crawling



Outline

1 Crawling, main questions

2 Practical considerations

3 Spam

4 Summary



Crawling

https://staff.fnwi.uva.nl/i.markov/

index.html bio.html publications.html teaching.html ilps.science.uva.nl ivi.uva.nl www.uva.nl cv/imarkov_cv/pdf

Content Storage


index.html bio.html publications.html teaching.html ilps.science.uva.nl ivi.uva.nl www.uva.nl cv/imarkov_cv/pdf

Content Storage


Content Storage


Content Storage

https://staff.fnwi.uva.nl/i.markov/ https://staff.fnwi.uva.nl/i.markov/



Main questions

What to crawl?

How to be polite?

How to keep the crawl fresh?



Outline

1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?


3 Spam

4 Summary



Outline




What to crawl?

Everything you possibly can

or the information suitable for your application



Taxonomy of crawlers

Taxonomy of CrawlersThe crawlers assign different importance to issues suchas freshness, quality, and volumeThe crawlers can be classified according to these threeaxes

Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 16Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”



Taxonomy of pages

Taxonomy of Web pagesA challenging aspect is related with the taxonomy ofWeb pagesWeb pages can be classified in public or private, andstatic or dynamic

There are in practice infinitely many dynamic pagesWe cannot expect a crawler to download all of themMost crawlers choose a maximum depth of dynamic links to follow

Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 23

Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”



Prioritizing pages to crawl

Random ordering

Breadth-first approach

Based on popularity

Links, e.g., PageRankVisits, e.g., click-through rate (CTR)



Outline




How to be polite?

A Web crawler must. . .

1 identify itself

2 obey the robots exclusion protocol

3 keep a low bandwidth usage in a given web site



Robot identification

Fill the user-agent field in the HTTP request

Include the word “crawler”, “robot”, “bot”, etc.



Robot exclusion protocol

Server-wise exclusionrobots.txt

Page-wise exclusion<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Cache exclusion<META NAME="ROBOTS" CONTENT="NOARCHIVE">



Bandwidth usage

Empirical thresholds

Adaptive politeness policy, e.g., 10 × t

crawl-delay:45 (in seconds in robots.txt)



Simple crawling thread implementation

Croft et al., “Search Engines, Information Retrieval in Practice”



Simple crawling thread implementation

Architecture of the CrawlersIn the short-term scheduler, enforcement of thepoliteness policy requires maintaining several queues,one for each site, and a list of pages to download ineach queue

Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 21Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”



Outline





1 Identify changes in web content

2 Measure these changes

3 Predict changes

4 Select pages to update



Identifying changes




Measuring changes

Freshness

Fp(t) =

{1 if p has not changed since last crawl

0 otherwise

Age

Ap(t) =

∫ t

0P(p changed at time x)(t − x)dx



Measuring changes




Predicting changes

Age

Ap(t) =

∫ t

0P(p changed at time x)(t − x)dx

Time between two changes follows an exponential distribution

Ap(t) =

∫ t

0λe−λx(t − x)dx

λ – average number of changes



Example

One update a week, λ = 1/7

In the end of the week, the expected age is approx. 2.6




Selecting pages to update

Optimizing freshness or age?

Optimizing freshness

Frequently updated pages will never be freshStop updating such pages

Optimizing age

The older a page gets, the more it costs not to crawl itCannot stop updating pages



Practical implementation

Maintain several queues

A queue for news sites that is refreshed several times a day

A daily or weekly queue for popular or relevant sites

A large queue for the rest of the Web



Main questions

What to crawl?

How to be polite?




Outline



3 Spam

4 Summary



Distributed crawling

Requirements

Synchronize URLs between threadsMinimize communication overhead

Approach1 Compute a hash function

of the URL’s domain2 Send the same hash value

to the same thread3 Send URLs in batches

2.3 Key Design Points 185

Frontier

Duplicate URLeliminator

Custom URL filter

Link extractor

URL prioritizer

HTTP fetcher

DNS resolver &cache

URL distributor

Crawling process 1

Web servers

DNS servers

Host names

HTML pages

IP addresses

URLs

URLs

URLs

URLs

URLs

URLsURLs

URLs

Frontier

Duplicate URLeliminator

Custom URL filter

Link extractor

URL prioritizer

HTTP fetcher

DNS resolver &cache

URL distributor

Crawling process 2

Web servers

DNS servers

Host names

HTML pages

IP addresses

URLs

URLs

URLs

URLs

URLs

URLs

Fig. 2.1 Basic crawler architecture.

2.3 Key Design Points

Web crawlers download web pages by starting from one or more

seed URLs, downloading each of the associated pages, extracting theIlya Markov [email protected] Information Retrieval 32


Duplicates

Mirroring

Different URLs, same content

URL modifiers, e.g., /dir/page.html&jsessid=09A89732



Sitemap




Deep web

No incoming links

Password-protected

Dynamic content

Web forms



Deep web example



Outline



3 Spam

4 Summary



Spam

1 Cloacking

2 Content spam

3 Link spam



Cloacking

Serving crawlers and users with different content



Content spam

Keyword stuffing

Raise the keyword count, variety, and density

Hidden or invisible text

Unrelated text using same color as the backgroundUsing tiny font sizeHiding within HTML code

Doorway pages

Designed to rank highly within search resultsRedirect users to spam content

Scraper sites

“Scrape” other sources of contentCreate “content” for a website

https://en.wikipedia.org/wiki/Spamdexing


https://en.wikipedia.org/wiki/Spamdexing


Link spam

Link farms

Buy a large number of domainsCreate a large number of sitesLink them to each other

Blog spam, comment spam, wiki spam

Hidden links



Identifying spam

Picture taken from http://smerity.com/media/talks/ml_for_your_robotic_army/template.html


http://smerity.com/media/talks/ml_for_your_robotic_army/template.html


Fighting against keyword stuffing

Compare the term distribution to the average one

Check fraction of globally popular words

Check compression rate

Truncate long documents



Outline



3 Spam

4 Summary



Data acquisition summary

Crawling

SelectionPolitnessFreshness

Practical considerations

Distributed crawlingDuplicatesSitemapDeep web

Spam

CloakingContent spamLink spam



Materials

Croft et al., Chapters 3.1–3.4, 9.1.5

Baeza-Yates and Ribeiro-Neto, Chapter 12



Course overview

Data Acquisition

Data Processing

Data Storage


Aggregated Search

Click Models


Offline

Online

Advanced



Next lecture

Data Acquisition

Data Processing

Data Storage


Aggregated Search

Click Models


Offline

Online

Advanced


Информационный поиск, осень 2016: Получение данных для...

Documents