synchronizing a database to improve freshness junghoo cho hector garcia-molina stanford university

21
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

Post on 19-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

Synchronizing a DatabaseTo Improve Freshness

Junghoo ChoHector Garcia-Molina

Stanford University

Page 2: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

2

Application– Web search engines/crawlers

– Data warehouse

. . .

Problem

Polling

Remote database Local database

QueryUpdate

Page 3: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

3

Challenge: How to maintain pages “fresh?”

How does the web change over time?– Web evolution experiment

What does fresh page/database mean?– Change metrics

How can we increase “freshness”?– Crawl policy

Page 4: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

4

Web Evolution Experiment

How often does a web page change? How do we model web changes? What is the lifespan of a page? How long does it take for 50% of the web change?

Page 5: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

5

Experimental Setup

February 17 to June 24, 1999 270 sites visited (with permission)

– identified 400 sites with highest “page rank”

– contacted administrators

720,000 pages collected– 3,000 pages from each site daily

– start at root, visit breadth first (get new & old pages)

– ran only 9pm - 6am, 10 seconds between site requests

Page 6: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

6

How Often Does a Page Change?

Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

Is this correct?

1 day

changes

page visited

Page 7: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

7

Average Change Intervalfr

actio

n of

pag

es

Page 8: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

8

Modeling Web Evolution

Poisson process with rate T is time to next event fT(t) = e-t (t > 0)

Page 9: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

9

Change Interval of Pages

for pages thatchange every

10 days on average

interval in days

frac

tion

of c

hang

esw

ith g

iven

inte

rval

Poisson model

Page 10: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

10

Change Metrics

Freshness– Freshness of page ei at time t is

F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

eiei

......

web database

– Freshness of the database S at time t is

F( S ; t ) = F( ei ; t )N

1 N

i=1

Page 11: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

11

Change Metrics

Age– Age of page ei at time t is

A( ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise

eiei

......

web database

– Age of the database S at time t is

A( S ; t ) = A( ei ; t )N

1 N

i=1

Page 12: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

12

Change Metrics

F(ei)

A(ei)

0

0

1

time

time

update refresh

F( S ) = lim F(S ; t ) dtt1 t

0t

F( ei ) = lim F(ei ; t ) dtt1 t

0t

Time averages:

similar for age...

Page 13: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

13

Refresh Order

Fixed order– Example: Explicit list of URLs to visit

Random Order– Example: Start from seed URLs

& follow links

Purely Random– Example: Refresh pages on demand, as requested by user

eiei

......

webdatabase

Page 14: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

14

Freshness vs. Order

r = / f = average change frequency / average revisit frequency

Page 15: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

15

Trick Question

Two page database

e1 changes daily

e2 changes once a week

Can visit pages once a week How should we visit pages?

– e1 e1 e1 e1 e1 e1 ...

– e2 e2 e2 e2 e2 e2 ...

– e1 e2 e1 e2 e1 e2 ... [uniform]

– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional]

– ?

e1

e2

e1

e2

webdatabase

Page 16: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

16

Proportional Often Not Good!

Visit fast changing e1 get 1/2 day of freshness

Visit slow changing e2 get 1/2 week of freshness

Visiting e2 is a better deal!

Page 17: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

17

Selecting Optimal Refresh Frequency

• Analysis is complex• Shape of curve is the same in all cases• Holds for any distribution g( )

Page 18: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

18

Optimal Refresh Frequency for Age

• Analysis is also complex• Shape of curve is the same in all cases• Holds for any distribution g( )

Page 19: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

19

Comparing Policies

Freshness AgeProportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month

Page 20: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

20

Summary

Maintaining the collection fresh:– Web evolution experiment

– Change metrics

– Optimal policy

Intuitive policy does not always perform well– Should be careful in deciding revisit policy

Page 21: Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

21

Future work

Weighted freshness model Non-Poisson process model Change frequency estimation