1 webbase and stanford digital library project junghoo cho stanford university
Post on 21-Dec-2015
220 views
TRANSCRIPT
1
WebBase and Stanford Digital Library Project
Junghoo Cho
Stanford University
Economic Weaknesses
Information Loss
Information Overload
Service Heterogeneity
Physical Barriers
• Interoperability
• Value Filtering
• Mobile Access
• IP Infrastructure
• Archival Repository
Technologies forTechnologies forDigital LibrariesDigital Libraries
2
RepositoryMulticastEngine
WWW
FeatureRepository
RetrievalIndexes
Webbase API
Web CrawlerWeb
CrawlerWeb CrawlerWeb Crawlers
Client Client Client Client
Client ClientWebBase Architecture
3
4
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
5
Crawling Issues (1)
Load at visited web sites Load at crawlers Scope of the crawl
6
Crawling Issues (2)
Maintaining pages “fresh”– How does the web change over time?
– What does fresh page/database mean?
– How can we increase “freshness”?
7
Outline
Web evolution experiments Freshness metrics Crawling policy
8
Web Evolution Experiment
How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to
change?
9
Experimental Setup
February 17 to June 24, 1999 270 sites visited (with permission)
– identified 400 sites with highest “page rank”
– contacted administrators
720,000 pages collected– 3,000 pages from each site daily
– start at root, visit breadth first (get new & old pages)
– ran only 9pm - 6am, 10 seconds between site requests
10
Page Change
Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days
Is this correct?
1 day
changes
page visited
11
Average Change Intervalfr
actio
n of
pag
es
12
Change Interval - By Domainfr
actio
n of
pag
es
13
Modeling Web Evolution
Poisson process with rate T is time to next event fT(t) = e-t (t > 0)
14
Change Interval of Pages
for pages thatchange every
10 days on average
interval in days
frac
tion
of c
hang
esw
ith g
iven
inte
rval
Poisson model
15
Change Metrics
Freshness– Freshness of page ei at time t is
F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
eiei
......
web database
– Freshness of the database S at time t is
F( S ; t ) = F( ei ; t )N
1 N
i=1
16
Change Metrics
Age– Age of page ei at time t is
A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise
eiei
......
web database– Age of the database S at time t is
A( S ; t ) = A( ei ; t )N
1 N
i=1
17
Change Metrics
F(ei)
A(ei)
0
0
1
time
time
update refresh
F( S ) = lim F(S ; t ) dtt1 t
0t
F( ei ) = lim F(ei ; t ) dtt1 t
0t
Time averages:
similar for age...
18
Trick Question
Two page database e1 changes daily e2 changes once a week Can visit pages once a week How should we visit pages?
– e1 e1 e1 e1 e1 e1 ...
– e2 e2 e2 e2 e2 e2 ...
– e1 e2 e1 e2 e1 e2 ... [uniform]
– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]
– ?
e1
e2
e1
e2
webdatabase
19
Proportional Often Not Good!
Visit fast changing e1 get 1/2 day of freshness
Visit slow changing e2 get 1/2 week of freshness
Visiting e2 is a better deal!
20
Selecting Optimal Refresh Frequency
• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )
21
Optimal Refresh Frequency for Age
• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )
22
Comparing Policies
Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days
Based on Statistics from experimentand revisit frequency of every month
23
Summary
Maintaining the collection fresh:– Web evolution experiment
– Change metrics
– Optimal policy
Intuitive policy does not always perform well– Should be careful in deciding revisit policy