1 webbase and stanford digital library project junghoo cho stanford university

23
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

Post on 21-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

1

WebBase and Stanford Digital Library Project

Junghoo Cho

Stanford University

Page 2: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

• Interoperability

• Value Filtering

• Mobile Access

• IP Infrastructure

• Archival Repository

Technologies forTechnologies forDigital LibrariesDigital Libraries

2

Page 3: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

RepositoryMulticastEngine

WWW

FeatureRepository

RetrievalIndexes

Webbase API

Web CrawlerWeb

CrawlerWeb CrawlerWeb Crawlers

Client Client Client Client

Client ClientWebBase Architecture

3

Page 4: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

4

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Page 5: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

5

Crawling Issues (1)

Load at visited web sites Load at crawlers Scope of the crawl

Page 6: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

6

Crawling Issues (2)

Maintaining pages “fresh”– How does the web change over time?

– What does fresh page/database mean?

– How can we increase “freshness”?

Page 7: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

7

Outline

Web evolution experiments Freshness metrics Crawling policy

Page 8: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

8

Web Evolution Experiment

How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to

change?

Page 9: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

9

Experimental Setup

February 17 to June 24, 1999 270 sites visited (with permission)

– identified 400 sites with highest “page rank”

– contacted administrators

720,000 pages collected– 3,000 pages from each site daily

– start at root, visit breadth first (get new & old pages)

– ran only 9pm - 6am, 10 seconds between site requests

Page 10: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

10

Page Change

Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

Is this correct?

1 day

changes

page visited

Page 11: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

11

Average Change Intervalfr

actio

n of

pag

es

Page 12: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

12

Change Interval - By Domainfr

actio

n of

pag

es

Page 13: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

13

Modeling Web Evolution

Poisson process with rate T is time to next event fT(t) = e-t (t > 0)

Page 14: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

14

Change Interval of Pages

for pages thatchange every

10 days on average

interval in days

frac

tion

of c

hang

esw

ith g

iven

inte

rval

Poisson model

Page 15: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

15

Change Metrics

Freshness– Freshness of page ei at time t is

F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

eiei

......

web database

– Freshness of the database S at time t is

F( S ; t ) = F( ei ; t )N

1 N

i=1

Page 16: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

16

Change Metrics

Age– Age of page ei at time t is

A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise

eiei

......

web database– Age of the database S at time t is

A( S ; t ) = A( ei ; t )N

1 N

i=1

Page 17: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

17

Change Metrics

F(ei)

A(ei)

0

0

1

time

time

update refresh

F( S ) = lim F(S ; t ) dtt1 t

0t

F( ei ) = lim F(ei ; t ) dtt1 t

0t

Time averages:

similar for age...

Page 18: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

18

Trick Question

Two page database e1 changes daily e2 changes once a week Can visit pages once a week How should we visit pages?

– e1 e1 e1 e1 e1 e1 ...

– e2 e2 e2 e2 e2 e2 ...

– e1 e2 e1 e2 e1 e2 ... [uniform]

– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]

– ?

e1

e2

e1

e2

webdatabase

Page 19: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

19

Proportional Often Not Good!

Visit fast changing e1 get 1/2 day of freshness

Visit slow changing e2 get 1/2 week of freshness

Visiting e2 is a better deal!

Page 20: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

20

Selecting Optimal Refresh Frequency

• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )

Page 21: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

21

Optimal Refresh Frequency for Age

• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )

Page 22: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

22

Comparing Policies

Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month

Page 23: 1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

23

Summary

Maintaining the collection fresh:– Web evolution experiment

– Change metrics

– Optimal policy

Intuitive policy does not always perform well– Should be careful in deciding revisit policy