antonio gulli

55
Antonio Gulli

Upload: michi

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Antonio Gulli. AGENDA. Overview of Spidering Technology … hey, how can I get that page? Overview of a Web Graph … tell me something about the arena Google Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Antonio Gulli

Antonio Gulli

Page 2: Antonio Gulli

AGENDA Overview of Spidering

Technology … hey, how can I get that page?

Overview of a Web Graph… tell me something about the arena

Google OverviewMining the WebDiscovering Knowledge from Hypertext DataSoumen Chakrabarti (CAP.2)Morgan-Kaufmann Publishers352 pages, cloth/hard-boundISBN 1-55860-754-4

Page 3: Antonio Gulli

Spidering 24h, 7days “walking” over a Graph, getting

data

What about the Graph? Recall the sample of 150 sites (last lesson) Direct graph G = (N, E) N changes (insert, delete) ~ 4-6 * 109 nodes E changes (insert, delete) ~ 10 links for node Size 10*4*109 = 4*1010 non zero entries in adj

matrix EX: suppose a 64 bit hash for URL, how much

space for storing the adj matrix?

Page 4: Antonio Gulli

A Picture of the Web Graph

[DelCorso, Gulli, Romani .. WAW04]

i

j Q: sparse or not sparse?

21 milions of pages, 150millions of links

Page 5: Antonio Gulli

A Picture of the Web Graph

[BRODER, www9]

Page 6: Antonio Gulli

A Picture of the Web Graph

Stanford

Berkeley

[Hawelivala, www12]

Q: what kind of sorting is this?

Page 7: Antonio Gulli

A Picture of the Web Graph

[Ravaghan, www9] READ IT!!!!!

Page 8: Antonio Gulli

The Web’s Characteristics Size

Over a billion pages available 5-10K per page => tens of terabytes Size doubles every 2 years

Change 23% change daily Half life time of about 10 days Bowtie structure

Page 9: Antonio Gulli

Search Engine Structure

Crawl Control

CrawlersRanking

Indexer

Page Repository

Query Engine

Collection Analysis

Text Structure Utility

Queries Results

Indexes

Page 10: Antonio Gulli

Bubble ???

Page 11: Antonio Gulli

Link Extractor:while(<ci sono pagine da cui estrarre i link>){ <prendi una pagina p dal page repository> <estrai i link contenuti nel tag a href> <estrai i link contenuti in javascript> <estrai ….. <estrai i link contenuti nei frameset> <inserisci i link estratti nella priority que, ciascuna con una priorità dipendente dalla politica scelta e: 1) compatibilmente ai filtri applicati 2) applicando le operazioni di normalizzazione> <marca p come pagina da cui abbiamo estratto i link>}

Downloaders:while(<ci sono url assegnate dai crawler manager>){ <estrai le url dalla coda di assegnamento> <scarica le pagine pi associate alla url dalla rete> <invia le pi al page repository>}

Crawler Manager:<estrai un bunch di url dalla “priority que” in ordine>while(<ci sono url assegnate dai crawler manager>){ <estrai le URL ed assegnale ad S> foreach u S {

if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && (<sul Web server la pagina è più recente> )

&& ( <u è un url accettata dal robot.txt del sito>) ) { <risolvi u rispetto al DNS> <invia u ai downloaders, in coda>

}}

Crawler “cycle of life”

Page 12: Antonio Gulli

Architecture of Incremental Crawler

Strutture Dati

Moduli Software

SPIDERSINDEXERS

ParallelDownloaders

DNS Revolvers

DNS Cache

Parallel Crawler Managers

AlreadySeen Pages

Robot.txtCache

Parallel Link Extractors

PriorityQue

Distributed Page Repository

Parsers

INTERNET

Page Analysis

Indexer

LEGENDA

……

[Gulli, 98]

Page 13: Antonio Gulli

Crawling Issues How to crawl?

Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load)-

How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?

How often to crawl? Freshness: How much has changed? How much has really changed? (why is this a different

question?)

How to parallelize the process

Page 14: Antonio Gulli

Page selection Crawler method for choosing page to download Given a page P, define how “good” that page is. Several metric types:

Interest driven Popularity driven (PageRank, full vs partial) BFS, DFS, Random Combined Random Walk

Potential quality measures: Final Indegree Final Pagerank

Page 15: Antonio Gulli

BFS “…breadth-first search order discovers the highest

quality pages during the early stages of the crawl BFS” 328 milioni di URL nel testbed

[Najork 01]

Q: how this is related to SCC, Power Laws. domains hierarchy in a Web Graph?

See more when we will do PageRank

Page 16: Antonio Gulli

Perc. Overlap with best x% by indegree

x% crawled by O(u) x% crawled by O(u)

Stanford Web Base (179K, 1998)[Cho98]

Perc. Overlap with best x% by pagerank

Page 17: Antonio Gulli

BFS & Spam (Worst case scenario)

BFS depth = 2

Normal avg outdegree = 10

100 URLs on the queue including a spam page.

Assume the spammer is able to generate dynamic pages with 1000 outlinks

StartPage

StartPage

BFS depth = 32000 URLs on the queue50% belong to the spammer

BFS depth = 41.01 million URLs on the queue99% belong to the spammer

Page 18: Antonio Gulli

Can you trust words on the page?

Examples from July 2002

auctions.hitsoffice.com/

www.ebay.com/Pornographic Content

Page 19: Antonio Gulli

A few spam technologies Cloaking

Serve fake content to search engine robot DNS cloaking: Switch IP address.

Impersonate Doorway pages

Pages optimized for a single keyword that re-direct to the real target page

Keyword Spam Misleading meta-keywords, excessive

repetition of a term, fake “anchor text” Hidden text with colors, CSS tricks, etc.

Link spamming Mutual admiration societies, hidden links,

awards Domain flooding: numerous domains that

point or re-direct to a target page Robots

Fake click stream Fake query stream Millions of submissions via Add-Url

Is this a SearchEngine spider?

Y

N

SPAM

RealDoc

Cloaking

Meta-Keywords = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”

Page 20: Antonio Gulli

Parallel Crawlers• Web is too big to be crawled by a single

crawler, work should be divided• Independent assignment

• Each crawler starts with its own set of URLs• Follows links without consulting other

crawlers• Reduces communication overhead• Some overlap is unavoidable

Page 21: Antonio Gulli

Parallel Crawlers• Dynamic assignment

• Central coordinator divides web into partitions• Crawlers crawl their assigned partition• Links to other URLs are given to Central coordinator

• Static assignment• Web is partitioned and divided to each crawler• Crawler only crawls its part of the web

Page 22: Antonio Gulli

URL-Seen Problem Need to check if file has been parsed or

downloaded before - after 20 million pages, we have “seen” over

100 million URLs - each URL is 50 to 75 bytes on average Options: compress URLs in main memory, or

use disk - Bloom Filter (Archive) [we will discuss this

later] - disk access with caching (Mercator,

Altavista)

Page 23: Antonio Gulli

Virtual Documents P = “whitehouse.org”, not yet reached {P1….Pr} reached, {P1….Pr} points to P Insert into the index the anchors context

…George Bush, President of U.S. lives at <a href=http://www.whitehouse.org> WhiteHouse</a>

Washington

Pagina Web e Documento Virtuale

bushWhite House

Page 24: Antonio Gulli

Focused Crawling Focused Crawler: selectively seeks out

pages that are relevant to a pre-defined set of topics.

- Topics specified by using exemplary documents (not keywords)

- Crawl most relevant links- Ignore irrelevant parts.- Leads to significant savings in hardware

and network resources.

Page 25: Antonio Gulli

Focused Crawling

niHjHEEP

HHEEH

jj

iii ,....,1

)Pr()|Pr()()Pr()|Pr()|Pr(

Pr[documento rilevante | il termine t è presente]Pr[documento irrilevante | il termine t è presente]Pr[termine t sia presente | il doc sia rilevante] Pr[termine t sia presente | il doc sia irrilevante]

Page 26: Antonio Gulli

An example of crawler Polybot crawl of 120 million pages over 19 days 161 million HTTP request 16 million robots.txt requests 138 million successful non-robots requests 17 million HTTP errors (401, 403, 404 etc) 121 million pages retrieved slow during day, fast at night peak about 300 pages/s over T3 many downtimes due to attacks, crashes, revisions http://cis.poly.edu/polybot/

[Suel 02]

Page 27: Antonio Gulli

Examples: Open Source

Nutch, also used by Overture http://www.nutch.org

Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html

Page 28: Antonio Gulli

Where we are? Spidering Technologies Web Graph (a glimpse)

Now, some funny math on two crawling issues

1) Hash for robust load balance 2) Mirror Detection

Page 29: Antonio Gulli

Consistent Hashing A mathematical tool

for: Spidering Web Cache P2P Routers Load Balance Distributed FS

Item and servers ← ID ( hash function of m bits)

Node identifier mapped on a 2^m ring

Item K assigned to first server with ID ≥ k

What if a downloader goes down?

What if a new downloader appear?

Page 30: Antonio Gulli

Duplicate/Near-Duplicate Detection

Duplication: Exact match with fingerprints Near-Duplication: Approximate match

Overview Compute syntactic similarity with an edit-

distance measure Use similarity threshold to detect near-

duplicates E.g., Similarity > 80% => Documents are “near

duplicates” Not transitive though sometimes used transitively

Page 31: Antonio Gulli

Computing Near Similarity Features:

Segments of a document (natural or artificial breakpoints) [Brin95]

Shingles (Word N-Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is

Similarity Measure TFIDF [Shiv95] Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union )

Page 32: Antonio Gulli

Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive and infeasible Approximate using a cleverly chosen subset of

shingles from each (a sketch)

Page 33: Antonio Gulli

Shingles + Set Intersection Estimate size_of_intersection / size_of_union based on a short sketch ( [Brod97, Brod98] ) Create a “sketch vector” (e.g., of size 200) for

each document Documents which share more than t (say 80%)

corresponding vector elements are similar For doc D, sketch[ i ] is computed as follows:

Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting)

Let i be a specific random permutation on 0..2m

Pick sketch[i] := MIN i ( f(s) ) over all shingles s in D

Page 34: Antonio Gulli

Computing Sketch[i] for Doc1Document 1

264

264

264

264

Start with 64 bit shingles

Permute on the number linewith i

Pick the min value

Page 35: Antonio Gulli

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

A B

Page 36: Antonio Gulli

However…Document 1 Document 2

264

264

264

264

264

264

264

264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection)

This happens with probability: Size_of_intersection / Size_of_union

BA

Page 37: Antonio Gulli

Mirror Detection Mirroring is systematic replication of web

pages across hosts. Single largest cause of duplication on the

web Host1/ and Host2/ are mirrors iff

For all (or most) paths p such that when http://Host1/ / p exists http://Host2/ / p exists as wellwith identical (or near identical) content,

and vice versa.

Page 38: Antonio Gulli

Mirror Detection example http://www.elsevier.com/ and

http://www.elsevier.nl/ Structural Classification of Proteins

http://scop.mrc-lmb.cam.ac.uk/scop http://scop.berkeley.edu/ http://scop.wehi.edu.au/scop http://pdb.weizmann.ac.il/scop http://scop.protres.ru/

Page 39: Antonio Gulli

Motivation Why detect mirrors?

Smart crawling Fetch from the fastest or freshest server Avoid duplication

Better connectivity analysis Combine inlinks Avoid double counting outlinks

Redundancy in result listings “If that fails you can try: <mirror>/samepath”

Proxy caching

Page 40: Antonio Gulli

Maintain clusters of subgraphs Initialize clusters of trivial subgraphs

Group near-duplicate single documents into a cluster Subsequent passes

Merge clusters of the same cardinality and corresponding linkage

Avoid decreasing cluster cardinality To detect mirrors we need:

Adequate path overlap Contents of corresponding pages within a small time range

Bottom Up Mirror Detection[Cho00]

Page 41: Antonio Gulli

Can we use URLs to find mirrors?

www.synthesis.org

a b

cd

synthesis.stanford.edu

a b

cd

www.synthesis.org/Docs/ProjAbs/synsys/synalysis.htmlwww.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.htmlwww.synthesis.org/Docs/annual.report96.final.htmlwww.synthesis.org/Docs/cicee-berlin-paper.htmlwww.synthesis.org/Docs/myr5www.synthesis.org/Docs/myr5/cicee/bridge-gap.htmlwww.synthesis.org/Docs/myr5/cs/cs-meta.htmlwww.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.htmlwww.synthesis.org/Docs/myr5/mech/mech-take-home.htmlwww.synthesis.org/Docs/myr5/synsys/experiential-learning.htmlwww.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.htmlwww.synthesis.org/Docs/yr5arwww.synthesis.org/Docs/yr5ar/assesswww.synthesis.org/Docs/yr5ar/ciceewww.synthesis.org/Docs/yr5ar/cicee/bridge-gap.htmlwww.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html

synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-…synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced…synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-…synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-…synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-…synthesis.stanford.edu/Docs/annual.report96.final.htmlsynthesis.stanford.edu/Docs/annual.report96.final_fn.htmlsynthesis.stanford.edu/Docs/myr5/assessmentsynthesis.stanford.edu/Docs/myr5/assessment/assessment-…synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-…synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.htmlsynthesis.stanford.edu/Docs/myr5/assessment/not-available.htmlsynthesis.stanford.edu/Docs/myr5/ciceesynthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.htmlsynthesis.stanford.edu/Docs/myr5/cicee/cicee-main.htmlsynthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html

Page 42: Antonio Gulli

Where we are?

Spidering Web Graph Some nice mathematical tools

Many others funny algorithmic for crawling issues…

Now, a glimpse on a Google: (thanks to Jungoo Cho, 3rd founder..)

Page 43: Antonio Gulli

Google: Scale

Number of pages indexed:3B in November 2002

Index refresh interval:Once per month ~ 1200 pages/sec

Number of queries per day:200M in April 2003 ~ 2000

queries/sec Runs on commodity Intel-Linux boxes

[Cho, 02]

Page 44: Antonio Gulli

Google: Other Statistics Average page size: 10KB Average query size: 40B Average result size: 5KB Average number of links per page: 10 Total raw HTML data size

3G x 10KB = 30 TB! Inverted index roughly the same size as

raw corpus: 30 TB for index itself With appropriate compression, 3:1

20 TB data residing in disk (and memory!!!)

Page 45: Antonio Gulli

Google:Data Size and Crawling Efficient crawl is very important

1 page/sec 1200 machines just for crawling

Parallelization through thread/event queue necessary

Complex crawling algorithm -- No, No! Well-optimized crawler

~ 100 pages/sec (10 ms/page) ~ 12 machines for crawling

Bandwidth consumption 1200 x 10KB x 8bit ~ 100Mbps One dedicated OC3 line (155Mbps) for

crawling ~ $400,000 per year

Page 46: Antonio Gulli

Google: Data Size, Query Processing Index size: 10TB 100 disks Typically less than 5 disks per machine Potentially 20-machine cluster to answer a

query If one machine goes down, the cluster goes

down Two-tier index structure can be helpful

Tier 1: Popular (high PageRank) page index Tier 2: Less popular page index Most queries can be answered by tier-1

cluster (with fewer machines)

Page 47: Antonio Gulli

Google: Implication of Query Load

2000 queries / sec Rule of thumb: 1 query / sec per CPU

Depends on number of disks, memory size, etc.

~ 2000 machines just to answer queries 5KB / answer page

2000 x 5KB x 8bit ~ 80 Mbps Half dedicated OC3 line (155Mbps) ~ $300,000

Page 48: Antonio Gulli

Google: Hardware

50,000 Intel-Linux cluster Assuming 99.9% uptime (8 hour downtime

per year) 50 machines are always down Nightmare for system administrators

Assuming 3-year hardware replacement Set up, replace and dump 50 machines

every day Heterogeneity is unavoidable

Page 49: Antonio Gulli
Page 50: Antonio Gulli

Shingles computation (for Web Clustering)

s1=a_rose_is_a_rose_in_the = w1 w2 w3 w4 w5 w6 w7 s2=rose_is_a_rose_in_the_garden = w1 w2 w3 w4 w5 w6 w7

0/1 representation (using ascii code) HP: a word is ~8 byte → length(si)=7*8 byte=448bit This represent S, a poly with coefficient in 0/1 (Z2)

Rabin fingerprint Map 448bit in a K=40bit space, with low collision prob. Generate an irreducible poly P with degree K-1 (see Galois Z2) F(S) = S mod P

antonio
see Some applications of Rabin's fingerprinting method (1993) , pag 2
Page 51: Antonio Gulli

No explicit random permutation better to work forcing length(si)=K*z, z in N (for instance 480bit) Induce a “random permutation” shifting of 480-448 bits (32 position)

Simple example (just 3 words, of 1 char each) S1=a_b_c_d, S2=b_c_d_e (a is 97 in ascii) S1

0/1=01100001 01100010 01100011 01100100 = 32 bit S2

0/1=01100010 01100011 01100100 01100101 = 32 bit HP K = 5 bit

S10/1=01100001 01100010 01100011 01100100 000 = 35 bit

S20/1=01100010 01100011 01100100 01100101 000 = 35 bit

Shingles computation (an example)

Page 52: Antonio Gulli

Shingles computation (an example)

We choose x4+x +1=10011 FOR S1

0/1

01100 mod 10011=12 mod 19 = 12 = 01100

(01100 + 00101) mod 10011 = (12 + 5) mod 19 = 17 = 10001

(10001 + 10001) mod 10011 = (17 + 17) mod 19 = 15 = 01111

(01111 + 00110) mod 10011 = (15 + 6) mod 19 = 2 = 00010

(00010 + 00110) mod 10011 = ( 2 + 6) mod 19 = 8 = 00100

(00100+11001) mod 10011 = (8 + 25) mod 19 = 14 = 01110

(01110 + 00000) mod 10001 = 01110 FINGERPRINT

FOR S20/1

We can do in O(1), EX: how????

Page 53: Antonio Gulli

Shingle Computation (using xor and other operations)

since coefficients are 0 or 1, can represent any such polynomial as a bit string

addition becomes XOR of these bit strings multiplication is shift & XOR modulo reduction done by repeatedly

substituting highest power with remainder of irreducible poly (also shift & XOR)

Page 54: Antonio Gulli

Shingles computation (final step)

Given the set of fingerprints Take 1 out of m fingerprints (instead of the

minimum) This is the set of fingerprints, for a given document D

Repeat the generation for each document D1, …, Dn

We obtain the set of tuple T ={<fingerprint, Di>, …} Sort T (external mem. sort for Syntatic Web Clustering) Linear Scan for counting adjacent tuples All those above a threshold are near-duplicate (cluster)

Page 55: Antonio Gulli

Shingles computation (an example)

S10/1=01100001 01100010 01100011 01100100 000

S20/1=01100010 01100011 01100100 01100101 000

S_2 = (S_1 - 01100 001 * 227) * 28 + 01100101 000 precompute 227 mod 10011= x precompute 28 mod 10011= y S_2 mod 10011 =[(S_1 mod 10011 - 01100 00 mod

10011 * x) * y + 01100101 000]mod 10011 --