Информационный поиск, осень 2016: Получение данных для...

49
Crawling, main questions Practical considerations Spam Summary Information Retrieval Data Acquisition Ilya Markov [email protected] University of Amsterdam Ilya Markov [email protected] Information Retrieval 1

Upload: cs-center

Post on 15-Apr-2017

254 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Information RetrievalData Acquisition

Ilya [email protected]

University of Amsterdam

Ilya Markov [email protected] Information Retrieval 1

Page 2: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Course overview

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 2

Page 3: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

This lecture

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 3

Page 4: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Data acquisition methods

Downloading

Feeds

API

Metadata harvesting

Crawling

Ilya Markov [email protected] Information Retrieval 4

Page 5: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questions

2 Practical considerations

3 Spam

4 Summary

Ilya Markov [email protected] Information Retrieval 5

Page 6: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Crawling

https://staff.fnwi.uva.nl/i.markov/

index.html bio.html publications.html teaching.html ilps.science.uva.nl ivi.uva.nl www.uva.nl cv/imarkov_cv/pdf

Content Storage

https://staff.fnwi.uva.nl/i.markov/

index.html bio.html publications.html teaching.html ilps.science.uva.nl ivi.uva.nl www.uva.nl cv/imarkov_cv/pdf

Content Storage

https://staff.fnwi.uva.nl/i.markov/

Content Storage

https://staff.fnwi.uva.nl/i.markov/

Content Storage

https://staff.fnwi.uva.nl/i.markov/ https://staff.fnwi.uva.nl/i.markov/

Ilya Markov [email protected] Information Retrieval 6

Page 7: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Main questions

What to crawl?

How to be polite?

How to keep the crawl fresh?

Ilya Markov [email protected] Information Retrieval 7

Page 8: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?

2 Practical considerations

3 Spam

4 Summary

Ilya Markov [email protected] Information Retrieval 8

Page 9: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?

Ilya Markov [email protected] Information Retrieval 9

Page 10: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

What to crawl?

Everything you possibly can

or the information suitable for your application

Ilya Markov [email protected] Information Retrieval 10

Page 11: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Taxonomy of crawlers

Taxonomy of CrawlersThe crawlers assign different importance to issues suchas freshness, quality, and volumeThe crawlers can be classified according to these threeaxes

Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 16Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”

Ilya Markov [email protected] Information Retrieval 11

Page 12: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Taxonomy of pages

Taxonomy of Web pagesA challenging aspect is related with the taxonomy ofWeb pagesWeb pages can be classified in public or private, andstatic or dynamic

There are in practice infinitely many dynamic pagesWe cannot expect a crawler to download all of themMost crawlers choose a maximum depth of dynamic links to follow

Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 23

Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”

Ilya Markov [email protected] Information Retrieval 12

Page 13: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Prioritizing pages to crawl

Random ordering

Breadth-first approach

Based on popularity

Links, e.g., PageRankVisits, e.g., click-through rate (CTR)

Ilya Markov [email protected] Information Retrieval 13

Page 14: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?

Ilya Markov [email protected] Information Retrieval 14

Page 15: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

How to be polite?

A Web crawler must. . .

1 identify itself

2 obey the robots exclusion protocol

3 keep a low bandwidth usage in a given web site

Ilya Markov [email protected] Information Retrieval 15

Page 16: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Robot identification

Fill the user-agent field in the HTTP request

Include the word “crawler”, “robot”, “bot”, etc.

Ilya Markov [email protected] Information Retrieval 16

Page 17: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Robot exclusion protocol

Server-wise exclusionrobots.txt

Page-wise exclusion<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Cache exclusion<META NAME="ROBOTS" CONTENT="NOARCHIVE">

Ilya Markov [email protected] Information Retrieval 17

Page 18: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Bandwidth usage

Empirical thresholds

Adaptive politeness policy, e.g., 10 × t

crawl-delay:45 (in seconds in robots.txt)

Ilya Markov [email protected] Information Retrieval 18

Page 19: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Simple crawling thread implementation

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 19

Page 20: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Simple crawling thread implementation

Architecture of the CrawlersIn the short-term scheduler, enforcement of thepoliteness policy requires maintaining several queues,one for each site, and a list of pages to download ineach queue

Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 21Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”

Ilya Markov [email protected] Information Retrieval 20

Page 21: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?

Ilya Markov [email protected] Information Retrieval 21

Page 22: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

How to keep the crawl fresh?

1 Identify changes in web content

2 Measure these changes

3 Predict changes

4 Select pages to update

Ilya Markov [email protected] Information Retrieval 22

Page 23: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Identifying changes

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 23

Page 24: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Measuring changes

Freshness

Fp(t) =

{1 if p has not changed since last crawl

0 otherwise

Age

Ap(t) =

∫ t

0P(p changed at time x)(t − x)dx

Ilya Markov [email protected] Information Retrieval 24

Page 25: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Measuring changes

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 25

Page 26: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Predicting changes

Age

Ap(t) =

∫ t

0P(p changed at time x)(t − x)dx

Time between two changes follows an exponential distribution

Ap(t) =

∫ t

0λe−λx(t − x)dx

λ – average number of changes

Ilya Markov [email protected] Information Retrieval 26

Page 27: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Example

One update a week, λ = 1/7

In the end of the week, the expected age is approx. 2.6

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 27

Page 28: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Selecting pages to update

Optimizing freshness or age?

Optimizing freshness

Frequently updated pages will never be freshStop updating such pages

Optimizing age

The older a page gets, the more it costs not to crawl itCannot stop updating pages

Ilya Markov [email protected] Information Retrieval 28

Page 29: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Practical implementation

Maintain several queues

A queue for news sites that is refreshed several times a day

A daily or weekly queue for popular or relevant sites

A large queue for the rest of the Web

Ilya Markov [email protected] Information Retrieval 29

Page 30: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Main questions

What to crawl?

How to be polite?

How to keep the crawl fresh?

Ilya Markov [email protected] Information Retrieval 30

Page 31: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questions

2 Practical considerations

3 Spam

4 Summary

Ilya Markov [email protected] Information Retrieval 31

Page 32: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Distributed crawling

Requirements

Synchronize URLs between threadsMinimize communication overhead

Approach1 Compute a hash function

of the URL’s domain2 Send the same hash value

to the same thread3 Send URLs in batches

2.3 Key Design Points 185

Frontier

Duplicate URLeliminator

Custom URL filter

Link extractor

URL prioritizer

HTTP fetcher

DNS resolver &cache

URL distributor

Crawling process 1

Web servers

DNS servers

Host names

HTML pages

IP addresses

URLs

URLs

URLs

URLs

URLs

URLsURLs

URLs

Frontier

Duplicate URLeliminator

Custom URL filter

Link extractor

URL prioritizer

HTTP fetcher

DNS resolver &cache

URL distributor

Crawling process 2

Web servers

DNS servers

Host names

HTML pages

IP addresses

URLs

URLs

URLs

URLs

URLs

URLs

Fig. 2.1 Basic crawler architecture.

2.3 Key Design Points

Web crawlers download web pages by starting from one or more

seed URLs, downloading each of the associated pages, extracting theIlya Markov [email protected] Information Retrieval 32

Page 33: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Duplicates

Mirroring

Different URLs, same content

URL modifiers, e.g., /dir/page.html&jsessid=09A89732

Ilya Markov [email protected] Information Retrieval 33

Page 34: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Sitemap

Croft et al., “Search Engines, Information Retrieval in Practice”

Ilya Markov [email protected] Information Retrieval 34

Page 35: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Deep web

No incoming links

Password-protected

Dynamic content

Web forms

Ilya Markov [email protected] Information Retrieval 35

Page 36: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Deep web example

Ilya Markov [email protected] Information Retrieval 36

Page 37: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Deep web example

Ilya Markov [email protected] Information Retrieval 37

Page 38: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questions

2 Practical considerations

3 Spam

4 Summary

Ilya Markov [email protected] Information Retrieval 38

Page 39: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Spam

1 Cloacking

2 Content spam

3 Link spam

Ilya Markov [email protected] Information Retrieval 39

Page 40: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Cloacking

Serving crawlers and users with different content

Ilya Markov [email protected] Information Retrieval 40

Page 41: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Content spam

Keyword stuffing

Raise the keyword count, variety, and density

Hidden or invisible text

Unrelated text using same color as the backgroundUsing tiny font sizeHiding within HTML code

Doorway pages

Designed to rank highly within search resultsRedirect users to spam content

Scraper sites

“Scrape” other sources of contentCreate “content” for a website

https://en.wikipedia.org/wiki/Spamdexing

Ilya Markov [email protected] Information Retrieval 41

Page 42: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Link spam

Link farms

Buy a large number of domainsCreate a large number of sitesLink them to each other

Blog spam, comment spam, wiki spam

Hidden links

Ilya Markov [email protected] Information Retrieval 42

Page 43: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Identifying spam

Picture taken from http://smerity.com/media/talks/ml_for_your_robotic_army/template.html

Ilya Markov [email protected] Information Retrieval 43

Page 44: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Fighting against keyword stuffing

Compare the term distribution to the average one

Check fraction of globally popular words

Check compression rate

Truncate long documents

Ilya Markov [email protected] Information Retrieval 44

Page 45: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Outline

1 Crawling, main questions

2 Practical considerations

3 Spam

4 Summary

Ilya Markov [email protected] Information Retrieval 45

Page 46: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Data acquisition summary

Crawling

SelectionPolitnessFreshness

Practical considerations

Distributed crawlingDuplicatesSitemapDeep web

Spam

CloakingContent spamLink spam

Ilya Markov [email protected] Information Retrieval 46

Page 47: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Materials

Croft et al., Chapters 3.1–3.4, 9.1.5

Baeza-Yates and Ribeiro-Neto, Chapter 12

Ilya Markov [email protected] Information Retrieval 47

Page 48: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Course overview

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 48

Page 49: Информационный поиск, осень 2016: Получение данных для поиска

Crawling, main questions Practical considerations Spam Summary

Next lecture

Data Acquisition

Data Processing

Data Storage

EvaluationRankingQuery Processing

Aggregated Search

Click Models

Present and Future of IR

Offline

Online

Advanced

Ilya Markov [email protected] Information Retrieval 49