Информационный поиск, осень 2016: Получение данных для...
TRANSCRIPT
Crawling, main questions Practical considerations Spam Summary
Information RetrievalData Acquisition
Ilya [email protected]
University of Amsterdam
Ilya Markov [email protected] Information Retrieval 1
Crawling, main questions Practical considerations Spam Summary
Course overview
Data Acquisition
Data Processing
Data Storage
EvaluationRankingQuery Processing
Aggregated Search
Click Models
Present and Future of IR
Offline
Online
Advanced
Ilya Markov [email protected] Information Retrieval 2
Crawling, main questions Practical considerations Spam Summary
This lecture
Data Acquisition
Data Processing
Data Storage
EvaluationRankingQuery Processing
Aggregated Search
Click Models
Present and Future of IR
Offline
Online
Advanced
Ilya Markov [email protected] Information Retrieval 3
Crawling, main questions Practical considerations Spam Summary
Data acquisition methods
Downloading
Feeds
API
Metadata harvesting
Crawling
Ilya Markov [email protected] Information Retrieval 4
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questions
2 Practical considerations
3 Spam
4 Summary
Ilya Markov [email protected] Information Retrieval 5
Crawling, main questions Practical considerations Spam Summary
Crawling
https://staff.fnwi.uva.nl/i.markov/
index.html bio.html publications.html teaching.html ilps.science.uva.nl ivi.uva.nl www.uva.nl cv/imarkov_cv/pdf
Content Storage
https://staff.fnwi.uva.nl/i.markov/
index.html bio.html publications.html teaching.html ilps.science.uva.nl ivi.uva.nl www.uva.nl cv/imarkov_cv/pdf
Content Storage
https://staff.fnwi.uva.nl/i.markov/
Content Storage
https://staff.fnwi.uva.nl/i.markov/
Content Storage
https://staff.fnwi.uva.nl/i.markov/ https://staff.fnwi.uva.nl/i.markov/
Ilya Markov [email protected] Information Retrieval 6
Crawling, main questions Practical considerations Spam Summary
Main questions
What to crawl?
How to be polite?
How to keep the crawl fresh?
Ilya Markov [email protected] Information Retrieval 7
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?
2 Practical considerations
3 Spam
4 Summary
Ilya Markov [email protected] Information Retrieval 8
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?
Ilya Markov [email protected] Information Retrieval 9
Crawling, main questions Practical considerations Spam Summary
What to crawl?
Everything you possibly can
or the information suitable for your application
Ilya Markov [email protected] Information Retrieval 10
Crawling, main questions Practical considerations Spam Summary
Taxonomy of crawlers
Taxonomy of CrawlersThe crawlers assign different importance to issues suchas freshness, quality, and volumeThe crawlers can be classified according to these threeaxes
Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 16Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”
Ilya Markov [email protected] Information Retrieval 11
Crawling, main questions Practical considerations Spam Summary
Taxonomy of pages
Taxonomy of Web pagesA challenging aspect is related with the taxonomy ofWeb pagesWeb pages can be classified in public or private, andstatic or dynamic
There are in practice infinitely many dynamic pagesWe cannot expect a crawler to download all of themMost crawlers choose a maximum depth of dynamic links to follow
Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 23
Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”
Ilya Markov [email protected] Information Retrieval 12
Crawling, main questions Practical considerations Spam Summary
Prioritizing pages to crawl
Random ordering
Breadth-first approach
Based on popularity
Links, e.g., PageRankVisits, e.g., click-through rate (CTR)
Ilya Markov [email protected] Information Retrieval 13
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?
Ilya Markov [email protected] Information Retrieval 14
Crawling, main questions Practical considerations Spam Summary
How to be polite?
A Web crawler must. . .
1 identify itself
2 obey the robots exclusion protocol
3 keep a low bandwidth usage in a given web site
Ilya Markov [email protected] Information Retrieval 15
Crawling, main questions Practical considerations Spam Summary
Robot identification
Fill the user-agent field in the HTTP request
Include the word “crawler”, “robot”, “bot”, etc.
Ilya Markov [email protected] Information Retrieval 16
Crawling, main questions Practical considerations Spam Summary
Robot exclusion protocol
Server-wise exclusionrobots.txt
Page-wise exclusion<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Cache exclusion<META NAME="ROBOTS" CONTENT="NOARCHIVE">
Ilya Markov [email protected] Information Retrieval 17
Crawling, main questions Practical considerations Spam Summary
Bandwidth usage
Empirical thresholds
Adaptive politeness policy, e.g., 10 × t
crawl-delay:45 (in seconds in robots.txt)
Ilya Markov [email protected] Information Retrieval 18
Crawling, main questions Practical considerations Spam Summary
Simple crawling thread implementation
Croft et al., “Search Engines, Information Retrieval in Practice”
Ilya Markov [email protected] Information Retrieval 19
Crawling, main questions Practical considerations Spam Summary
Simple crawling thread implementation
Architecture of the CrawlersIn the short-term scheduler, enforcement of thepoliteness policy requires maintaining several queues,one for each site, and a list of pages to download ineach queue
Web Crawling, Modern Information Retrieval, Addison Wesley, 2010 – p. 21Baeza-Yates and Ribeiro-Neto, “Modern Information Retrieval”
Ilya Markov [email protected] Information Retrieval 20
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questionsWhat to crawl?How to be polite?How to keep the crawl fresh?
Ilya Markov [email protected] Information Retrieval 21
Crawling, main questions Practical considerations Spam Summary
How to keep the crawl fresh?
1 Identify changes in web content
2 Measure these changes
3 Predict changes
4 Select pages to update
Ilya Markov [email protected] Information Retrieval 22
Crawling, main questions Practical considerations Spam Summary
Identifying changes
Croft et al., “Search Engines, Information Retrieval in Practice”
Ilya Markov [email protected] Information Retrieval 23
Crawling, main questions Practical considerations Spam Summary
Measuring changes
Freshness
Fp(t) =
{1 if p has not changed since last crawl
0 otherwise
Age
Ap(t) =
∫ t
0P(p changed at time x)(t − x)dx
Ilya Markov [email protected] Information Retrieval 24
Crawling, main questions Practical considerations Spam Summary
Measuring changes
Croft et al., “Search Engines, Information Retrieval in Practice”
Ilya Markov [email protected] Information Retrieval 25
Crawling, main questions Practical considerations Spam Summary
Predicting changes
Age
Ap(t) =
∫ t
0P(p changed at time x)(t − x)dx
Time between two changes follows an exponential distribution
Ap(t) =
∫ t
0λe−λx(t − x)dx
λ – average number of changes
Ilya Markov [email protected] Information Retrieval 26
Crawling, main questions Practical considerations Spam Summary
Example
One update a week, λ = 1/7
In the end of the week, the expected age is approx. 2.6
Croft et al., “Search Engines, Information Retrieval in Practice”
Ilya Markov [email protected] Information Retrieval 27
Crawling, main questions Practical considerations Spam Summary
Selecting pages to update
Optimizing freshness or age?
Optimizing freshness
Frequently updated pages will never be freshStop updating such pages
Optimizing age
The older a page gets, the more it costs not to crawl itCannot stop updating pages
Ilya Markov [email protected] Information Retrieval 28
Crawling, main questions Practical considerations Spam Summary
Practical implementation
Maintain several queues
A queue for news sites that is refreshed several times a day
A daily or weekly queue for popular or relevant sites
A large queue for the rest of the Web
Ilya Markov [email protected] Information Retrieval 29
Crawling, main questions Practical considerations Spam Summary
Main questions
What to crawl?
How to be polite?
How to keep the crawl fresh?
Ilya Markov [email protected] Information Retrieval 30
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questions
2 Practical considerations
3 Spam
4 Summary
Ilya Markov [email protected] Information Retrieval 31
Crawling, main questions Practical considerations Spam Summary
Distributed crawling
Requirements
Synchronize URLs between threadsMinimize communication overhead
Approach1 Compute a hash function
of the URL’s domain2 Send the same hash value
to the same thread3 Send URLs in batches
2.3 Key Design Points 185
Frontier
Duplicate URLeliminator
Custom URL filter
Link extractor
URL prioritizer
HTTP fetcher
DNS resolver &cache
URL distributor
Crawling process 1
Web servers
DNS servers
Host names
HTML pages
IP addresses
URLs
URLs
URLs
URLs
URLs
URLsURLs
URLs
Frontier
Duplicate URLeliminator
Custom URL filter
Link extractor
URL prioritizer
HTTP fetcher
DNS resolver &cache
URL distributor
Crawling process 2
Web servers
DNS servers
Host names
HTML pages
IP addresses
URLs
URLs
URLs
URLs
URLs
URLs
Fig. 2.1 Basic crawler architecture.
2.3 Key Design Points
Web crawlers download web pages by starting from one or more
seed URLs, downloading each of the associated pages, extracting theIlya Markov [email protected] Information Retrieval 32
Crawling, main questions Practical considerations Spam Summary
Duplicates
Mirroring
Different URLs, same content
URL modifiers, e.g., /dir/page.html&jsessid=09A89732
Ilya Markov [email protected] Information Retrieval 33
Crawling, main questions Practical considerations Spam Summary
Sitemap
Croft et al., “Search Engines, Information Retrieval in Practice”
Ilya Markov [email protected] Information Retrieval 34
Crawling, main questions Practical considerations Spam Summary
Deep web
No incoming links
Password-protected
Dynamic content
Web forms
Ilya Markov [email protected] Information Retrieval 35
Crawling, main questions Practical considerations Spam Summary
Deep web example
Ilya Markov [email protected] Information Retrieval 36
Crawling, main questions Practical considerations Spam Summary
Deep web example
Ilya Markov [email protected] Information Retrieval 37
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questions
2 Practical considerations
3 Spam
4 Summary
Ilya Markov [email protected] Information Retrieval 38
Crawling, main questions Practical considerations Spam Summary
Spam
1 Cloacking
2 Content spam
3 Link spam
Ilya Markov [email protected] Information Retrieval 39
Crawling, main questions Practical considerations Spam Summary
Cloacking
Serving crawlers and users with different content
Ilya Markov [email protected] Information Retrieval 40
Crawling, main questions Practical considerations Spam Summary
Content spam
Keyword stuffing
Raise the keyword count, variety, and density
Hidden or invisible text
Unrelated text using same color as the backgroundUsing tiny font sizeHiding within HTML code
Doorway pages
Designed to rank highly within search resultsRedirect users to spam content
Scraper sites
“Scrape” other sources of contentCreate “content” for a website
https://en.wikipedia.org/wiki/Spamdexing
Ilya Markov [email protected] Information Retrieval 41
Crawling, main questions Practical considerations Spam Summary
Link spam
Link farms
Buy a large number of domainsCreate a large number of sitesLink them to each other
Blog spam, comment spam, wiki spam
Hidden links
Ilya Markov [email protected] Information Retrieval 42
Crawling, main questions Practical considerations Spam Summary
Identifying spam
Picture taken from http://smerity.com/media/talks/ml_for_your_robotic_army/template.html
Ilya Markov [email protected] Information Retrieval 43
Crawling, main questions Practical considerations Spam Summary
Fighting against keyword stuffing
Compare the term distribution to the average one
Check fraction of globally popular words
Check compression rate
Truncate long documents
Ilya Markov [email protected] Information Retrieval 44
Crawling, main questions Practical considerations Spam Summary
Outline
1 Crawling, main questions
2 Practical considerations
3 Spam
4 Summary
Ilya Markov [email protected] Information Retrieval 45
Crawling, main questions Practical considerations Spam Summary
Data acquisition summary
Crawling
SelectionPolitnessFreshness
Practical considerations
Distributed crawlingDuplicatesSitemapDeep web
Spam
CloakingContent spamLink spam
Ilya Markov [email protected] Information Retrieval 46
Crawling, main questions Practical considerations Spam Summary
Materials
Croft et al., Chapters 3.1–3.4, 9.1.5
Baeza-Yates and Ribeiro-Neto, Chapter 12
Ilya Markov [email protected] Information Retrieval 47
Crawling, main questions Practical considerations Spam Summary
Course overview
Data Acquisition
Data Processing
Data Storage
EvaluationRankingQuery Processing
Aggregated Search
Click Models
Present and Future of IR
Offline
Online
Advanced
Ilya Markov [email protected] Information Retrieval 48
Crawling, main questions Practical considerations Spam Summary
Next lecture
Data Acquisition
Data Processing
Data Storage
EvaluationRankingQuery Processing
Aggregated Search
Click Models
Present and Future of IR
Offline
Online
Advanced
Ilya Markov [email protected] Information Retrieval 49