december 20, 2002cul metadata wg meeting1 focused crawling and collection synthesis donna bergmark...
Post on 22-Dec-2015
215 Views
Preview:
TRANSCRIPT
December 20, 2002 CUL Metadata WG Meeting 1
Focused Crawling and Collection Synthesis
Donna Bergmark
Cornell Information Systems
December 20, 2002 CUL Metadata WG Meeting 2
Outline
• Crawlers
• Collection Synthesis
• Focused Crawling
• Some Results
• Student Project (Fall 2002)
December 20, 2002 CUL Metadata WG Meeting 3
Definition
Spider = robot = crawler
Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
December 20, 2002 CUL Metadata WG Meeting 4
Crawlers – some background
• Resource discovery
• Crawlers and internet history
• Crawling and crawlers
• Mercator
December 20, 2002 CUL Metadata WG Meeting 5
Resource Discovery
• Finding info on the Web– Surfing (random strategy, goal is serendipity)
– Searching (inverted indices; specific info)
– Crawling (“all” the info)
• Uses for crawling– Find stuff
– Gather stuff
– Check stuff
December 20, 2002 CUL Metadata WG Meeting 6
Crawlers and internet history• 1991: HTTP• 1992: 26 servers• 1993: 60+ servers; self-register; archie• 1994 (early) – first crawlers• 1996 – search engines abound• 1998 – focused crawling• 1999 – web graph studies• 2002 – use for digital libraries
December 20, 2002 CUL Metadata WG Meeting 7
Crawling and Crawlers
• Web overlays the internet
• A crawl overlays the webseed
December 20, 2002 CUL Metadata WG Meeting 8
Crawler Issues
• The web is so big
• Visit Order
• The URL itself
• Politeness
• Robot Traps
• The hidden web
• System Considerations
December 20, 2002 CUL Metadata WG Meeting 9
Standard for Robot Exclusion
• Martin Koster (1994)
• http://any-server:80/robots.txt
• Maintained by the webmaster
• Forbid access to pages, directories
• Commonly excluded: /cgi-bin/
• Adherence is voluntary for the crawler
December 20, 2002 CUL Metadata WG Meeting 10
Robot Traps
• Cycles in the Web graph
• Infinite links on a page
• Traps set out by the Webmaster
December 20, 2002 CUL Metadata WG Meeting 11
The Hidden Web
• Dynamic pages increasing
• Subscription pages
• Username and password pages
• Research in progress on how crawlers can “get into” the hidden web
December 20, 2002 CUL Metadata WG Meeting 12
System Issues
• Crawlers are complicated systems
• Efficiency is of utmost importance
• Crawlers are demanding of system and network resources
13CUL Metadata WG MeetingDecember 20, 2002
December 20, 2002 CUL Metadata WG Meeting 14
Mercator Features
• Written in Java• One file configures a crawl• Can add your own code
– Extend one or more of M’s base classes– Add totally new classes called by your own
• Industrial-strength crawler:– uses its own DNS and java.net package
December 20, 2002 CUL Metadata WG Meeting 15
Collection Synthesis
• The NSDL– National Scientific Digital Library– Educational materials for K-thru-grave– A collection of digital collections
• Collection (automatically derived)– 20-50 items on a topic, represented by their
URLs, expository in nature, precision trumps recall
December 20, 2002 CUL Metadata WG Meeting 16
Crawler is the Key
• A general search engine is good for precise results, few in number
• A search engine must cover all topics, not just scientific
• For automatic collection assembly, a Web crawler is needed
• A focused crawler is the key
December 20, 2002 CUL Metadata WG Meeting 17
Focused Crawling
December 20, 2002 CUL Metadata WG Meeting 18
Focused Crawling
432
765
1
1
R
Breadth-first crawl
1
432
5R
X X
Focused crawl
December 20, 2002 CUL Metadata WG Meeting 19
Collections and Clusters
• Traditional – document universe is divided into clusters, or collections
• Each collection represented by its centroid• Web – size of document universe is infinite• Agglomerative clustering is used instead• Two aspects:
– Collection descriptor– Rule for when items belong to that Collection
December 20, 2002 CUL Metadata WG Meeting 20
Q = 0.2
Q = 0.6
December 20, 2002 CUL Metadata WG Meeting 21
The Setup
A virtual collection of items about Chebyshev Polynomials
December 20, 2002 CUL Metadata WG Meeting 22
Adding a Centroid
An empty collection of items about Chebyshev Polynomials
December 20, 2002 CUL Metadata WG Meeting 23
Document Vector Space
• Classic information retrieval technique
• Each word is a dimension in N-space
• Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001>
• Normalize the weights
Both the “centroid” and the downloaded document are term vectors
December 20, 2002 CUL Metadata WG Meeting 24
Agglomerate
A collection with 3 items about Ch. Polys.
December 20, 2002 CUL Metadata WG Meeting 25
Where does the Centroid come from?
“ChebyshevPolynomials”
A really good centroid fora collection about C.P.’s
December 20, 2002 CUL Metadata WG Meeting 26
Building a Centroid
1. Google(“Chebyshev Polynomials”) {url1 … url-n
2. Let H be a hash (k,v) where k=word, value=freq
3. For each url in {u1 … un} do
D download(url)V term vector(d)
For each term t in V doIf t not in H add it with value H(t) ++
4. Compute tf-idf weights. C top 20 terms.
December 20, 2002 CUL Metadata WG Meeting 27
Dictionary
• Given centroids C1, C2, C3 …
• Dictionary is C1 + C2 + C3 …– Terms are union of terms in Ci– Term Frequencies are total frequency in Ci– Document Frequency is how many C’s have t– Term IDF is as from Berkeley
• Dictionary is 300-500 terms
December 20, 2002 CUL Metadata WG Meeting 28
Focused Crawling• Recall the cartoon for a focused crawl:
• A simple way to do it is with 2 “knobs”
1
432
5R
X X
December 20, 2002 CUL Metadata WG Meeting 29
Focusing the Crawl
• Threshold: page is on-topic if correlation to the closest centroid is above this value
• Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff
December 20, 2002 CUL Metadata WG Meeting 30
Illustration
2 3
4
6
7
1
5555
Cutoff = 1
Corr >= threshold
December 20, 2002 CUL Metadata WG Meeting 31
Min-avg-max correlation vs. crawl length
00.10.2
0.30.40.50.6
0.70.8
0 20000 40000 60000 80000 100000 120000
No. documents downloaded
corr
elat
ion Maximum
Average
Minimum
Closest
Furthest
December 20, 2002 CUL Metadata WG Meeting 32
Collection “Evaluation”
• Assume higher correlations are good
• With human relevance assessments, one can also compute a “precision” curve
• Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.
December 20, 2002 CUL Metadata WG Meeting 33
Cutoff = 0Threshold = 0.3
December 20, 2002 CUL Metadata WG Meeting 34
Precision vs. Rank
0
0.2
0.4
0.6
0.8
1
1.2
0 20 40 60
Rank
Pre
cisi
on
Crawling
December 20, 2002 CUL Metadata WG Meeting 35
Tunneling with Cutoff
• Nugget – dud – dud… - dud – nugget
Notation: 0 – X – X … - X – 0
• Fixed cutoff: 0 – X1 – X2 - … Xc
• Adaptive cutoff: 0 – X1 – X2 - … X?
December 20, 2002 CUL Metadata WG Meeting 36
Statistics Collected
• 500,000 documents
• Number of seeds: 4
• Path data for all but seeds
• 6620 completed paths (0-x…x-0)
• 100,000s incomplete paths (0-x…x..)
December 20, 2002 CUL Metadata WG Meeting 37
Nuggets that are x steps from a nugget
0
200
400
600
800
1000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
X - number of links from nugget
# nuggets
December 20, 2002 CUL Metadata WG Meeting 38
Nuggets that are x steps from a seed and/or a nugget
0
200
400
600
800
1000
1200
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
X - number of links from nugget
from seeds# nuggets
December 20, 2002 CUL Metadata WG Meeting 39
Better parents have better children.
0
0.05
0.1
0.15
0.2
0.251 3 5 7 9 11
13
15
17
Correlation bracket
Nu
mb
er
of
no
de
s
General Population
children of .45-.5nodes
December 20, 2002 CUL Metadata WG Meeting 40
Using the Empirical Observations
• Use the path history
• Use the page quality - cosine correlation
• Current distance should increase exponentially as you get away from quality nodes
Distance = 0 if this is a nugget, otherwise:1 or (1-corr) exp (2 x parent’s distance / cutoff)
December 20, 2002 CUL Metadata WG Meeting 41
Results
• Details in the ECDL paper
• Smaller frontier more docs/second
• More documents downloaded in same time
• Higher-scoring documents were downloaded
• Cutoff of 20 averaged 7 steps at the cutoff
December 20, 2002 CUL Metadata WG Meeting 42
Fall 2002 Student Project
Query
Mercator
Centroid Collection Description
Term vectors
Centroids,Dictionary
CollectionURLs
Chebyshev P.s HTML
December 20, 2002 CUL Metadata WG Meeting 43
Conclusion
• We’ve covered crawling – history, technology, use
• Focused crawling with tunneling• Adaptive cutoff with tunneling
• We have a good experimental setup for exploring automatic collection synthesis
top related