crawling and ranking

Crawling and Ranking

HTML (HyperText Markup Language)

• Described the structure and content of a (web) document• HTML 4.01: most common version, W3C standard• XHTML 1.0: XML-ization of HTML 4.01, minor differences• Validation (http://validator.w3.org/) against a schema. Checks

the conformity of a Web page with respect to recommendations, for accessibility:– to all graphical browsers (IE, Firefox, Safari, Opera, etc.)– to text browsers (lynx, links, w3m, etc.)– to all other user agents including Web crawlers

http://validator.w3.org/

The HTML language

• Text and tags

• Tags define structure– Used for instance by a browser to lay out the

document.

• Header and Body

HTML structure <!DOCTYPE html …> <html lang="en"> <head>  </head> <body>  </body> </html>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN“ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns=http://www.w3.org/1999/xhtml lang="en" xml:lang="en"><head><meta http-equiv="Content-Type“ content="text/html; charset=utf-8" /><title>Example XHTML document</title></head><body><p>This is a <a href="http://www.w3.org/">link to theW3C</a></p></body></html>

Header

• Appears between the tags <head> ... </head>

• Includes meta-data such as language, encoding…

• Also include document title

• Used by (e.g.) the browser to decipher the body

Body• Between <body> ... </body> tags • The body is structured into sections, paragraphs,

lists, etc. <h1>Title of the page</h1> <h2>Title of a main section</h2> <h3>Title of a subsection</h3> . . .• <p> ... </p> define paragraphs• More block elements such as table, list…

HTTP• Application protocol

Client request: GET /MarkUp/ HTTP/1.1

Host: www.google.comServer response:HTTP/1.1 200 OK

• Two main HTTP methods: GET and POST

GET

URL: http://www.google.com/search?q=BGU

Corresponding HTTP GET request:GET /search?q=BGU HTTP/1.1Host: www.google.com

POST

• Used for submitting forms

POST /php/test.php HTTP/1.1Host: www.bgu.ac.ilContent-Type: application/x-www-

formurlencodedContent-Length: 100…

Status codes

• HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK)

• First digit indicates the class of the response:1 Information2 Success3 Redirection4 Client-side error5 Server-side error

Authentication

• HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc.

• It can be used instead to transmit sensitive data

GET ... HTTP/1.1Authorization: Basic dG90bzp0aXRp

Cookies• Key/value pairs, that a server asks a client to store and

retransmit with each HTTP request (for a given domain name).

• Can be used to keep information on users between visits

• Often what is stored is a session ID– Connected, on the server side, to all session

information

Crawling

Basics of Crawling• Crawlers, (Web) spiders, (Web) robots: autonomous agents

that retrieve pages from the Web

• Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL

Problem: The web is huge!

Discovering new URLs

• Browse the "internet graph" (following e.g. hyperlinks)

• Site maps (sitemap.org)

The internet graph

• At least 14.06 billion nodes = pages

• At least 140 billion edges = links

• Lots of "junk"

Graph-browsing algorithms

• Depth-first

• Breath-first

• Combinations..

• Parallel crawling

Duplicates

• Identifying duplicates or near-duplicates on the Web to prevent multiple indexing

• Trivial duplicates: same resource at the same canonized URL:

http://example.com:80/toto http://example.com/titi/../toto• Exact duplicates: identification by hashing• near-duplicates: (timestamps, tip of the day, etc.)

more complex!

Near-duplicate detection

• Edit distance– Good measure of similarity,– Does not scale to a large collection of documents

(unreasonable to compute the edit distance for every pair!).

• Shingles: two documents similar if they mostly share the same succession of k-grams

Crawling ethics

• robots.txt at the root of a Web server

• User-agent: * Allow: /searchhistory/ Disallow: /search• Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW">• Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a>• Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server

Overview

• Crawl

• Retrieve relevant documents – How?• To define relevance, to find relevant docs..

• Rank– How?

Relevance

• Input: keyword (or set of keywords), “the web”

• First question: how to define the relevance of a page with respect to a key word?

• Second question: how to store pages such that the relevant ones for a given keyword are easily retrieved?

Relevance definition

• Boolean based on existence of a word in the document– Synonyms– Disadvantages?

• Word count– Synonyms– Disadvantages?

• Can we do better?

TF-IDF

Storing pages

• Offline pre-processing can help online search

• Offline preprocessing includes stemming, stop words removal…

• As well as the creation of an index

Inverted Index

More advanced text analysis

• N-grams

• HMM language models

• PCFG langage models

• We will discuss all that later in the course!

Ranking

Why Ranking?

• Huge number of pages

• Huge even if we filter according to relevance– Keep only pages that include the keywords

• A lot of the pages are not informative– And anyway it is impossible for users to go

through 10K results

When to rank?

• Before retrieving results– Advantage: offline!– Disadvantage: huge set

• After retrieving results– Advantage: smaller set– Disadvantage: online, user is waiting..

How to rank?

• Observation: links are very informative!

• Not just for discovering new sites, but also for estimating the importance of a site

• CNN.com has more links to it than my homepage…

• Quality and Efficiency are key factors

Authority and Hubness

• Authority: a site is very authoritative if it receives many citations. Citation from important sites has more weight than citations from less-important sitesA(v) = The authority of v

• Hubness A good hub is a site that links to many authoritative sites

H(v) = The hubness of v

HITS

• Recursive dependency: a(v) = Σ(u,v) h(u) h(v) = Σ(v,u) a(u) Normalize (when?) according to square root of

sum of squares of authorities \ hubness values• Start by setting all values to 1– We could also add bias

• We can show that a(v) and h(v) converge

HITS (cont.)

• Works rather well if applied only on relevant web pages– E.g. pages that include the input keywords

• The results are less satisfying if applied on the whole web

• On the other hand, online ranking is a problem

Google PageRank

• Works offline, i.e. computes for every web-site a score that can then be used online

• Extremely efficient and high-quality

• The PageRank algorithm that we will describe here appears in [Brin & Page, 1998]

Random Surfer Model

• Consider a "random surfer"

• At each point chooses a link and clicks on it

• A link is chosen with uniform distribution– A simplifying assumption..

• What is the probability of being, at a random time, at a web-page W?

Recursive definition

• If PageRank reflects the probability of being in a web-page (PR(w) = P(w)) then

PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) Where O(W) is the out-degree of W

Problems

• A random surfer may get stuck in one component of the graph

• May get stuck in loops

• “Rank Sink” Problem– Many Web pages have no inlinks/outlinks

Damping Factor

• Add some probability d for "jumping" to a random page

• Now PR(W) = (1-d) * [PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index

How to compute PR?

• Simulation

• Analytical methods– Can we solve the equations?

Simulation: A random surfer algorithm

• Start from an arbitrary page• Toss a coin to decide if you want to follow a

link or to randomly choose a new page• Then toss another coin to decide which link to

follow \ which page to go to• Keep record of the frequency of the web-

pages visited

Convergence

• Not guaranteed without the damping factor!• (Partial) intuition: if unlucky, the algorithm

may get stuck forever in a connected component

• Claim: with damping, the probability of getting stuck forever is 0

• More difficult claim: with damping, convergence is guaranteed

Markov Chain Monte Carlo (MCMC)

• A class of very useful algorithms for sampling a given distribution

• We first need to know what is a Markov Chain

Markov Chain

• A finite or countably infinite state machine

• We will consider the case of finitely many states

• Transitions are associated with probabilities

• Markovian property: given the present state, future choices are independent from the past

MCMC framework

• Construct (explicitly or implicitly) a Markov Chain (MC) that describes the desired distribution

• Perform a random walk on the MC, keeping track of the proportion of state visits– Discard samples made before “Mixing”

• Return proportion as an approximation of the correct distribution

Properties of Markov Chains

• A Markov Chain defines a distribution on the different states (P(state)= probability of being in the state at a random time)

• We want conditions on when this distribution is unique, and when will a random walk approximate it

Properties

• Periodicity– A state i has period k if any return to state i must

occur in multiples of k time steps– Aperiodic: period = 1 for all states

• Reducibility– An MC is irreducible if there is a probability 1 of

(eventually) getting from every state to every state• Theorem: A finite-state MC has a unique

stationary distribution if it is aperiodic and irreducible

Back to PageRank

• The MC is on the graph with probabilities we have defined

• MCMC is the random walk algorithm

• Is the MC aperiodic? Irreducible?

• Why?

Problem with MCMC

• In general no guarantees on convergence time– Even for those “nice” MCs

• A lot of work on characterizing “nicer” MCs– That will allow fast convergence

• In practice for the web graph it converges rather slowly– Why?

A different approach

• Reconsider the equation system

PR(W) = (1-d) * [PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N • A linear equation system!

Transition Matrix

T= (0 0.33 0.33 0.33 0 0 0.5 0.5 0.25 0.25 0.25 0.25 0 0 0 0)

Stochastic matrix

EigenVector!

• PR (column vector) is the right eigenvector of the stochastic transition matrix– I.e. the adjacency matrix normalized to have the

sum of every column to be 1• The Perron-Frobinius theorem ensures that

such a vector exists

• Unique under the same assumptions as before

Direct solution

• Solving the equations set – Via e.g. Gaussian elimination

• This is time-consuming

• Observation: the matrix is sparse

• So iterative methods work better here

Power method• Start with some arbitrary rank vector R0

• Compute Ri = A Ri-1

• If we happen to get to the eigenvector we will stay there

• Theorem: The process converges to the eigenvector!• Convergence is in practice pretty fast (~100 iterations)

Power method (cont.)

• Every iteration is still “expensive”

• But since the matrix is sparse it becomes feasible

• Still, need a lot of tweaks and optimizations to make it work efficiently

Other issues• Accelerating Computation

• Updates

• Distributed PageRank

• Mixed Model (Incorporating "static" importance)

• Personalized PageRank

crawling and ranking

Documents