modeling and managing content changes in text databases panos ipeirotis new york university...

30
Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia University

Upload: stephanie-reese

Post on 26-Mar-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Modeling and Managing Content Changes in Text Databases

Panos IpeirotisNew York University

Alexandros NtoulasUCLA

Junghoo ChoUCLA

Luis GravanoColumbia University

Page 2: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

thrombopenia

Metasearchers Provide Access to Text Databases

Metasearcher

NYTimesArchives

PubMed USPTO

Broadcasting queries to all databases not feasible (~100,000 DBs)

•Large number of hidden-web databases available

•Contents not accessible through Google

•Need to query each database separately

Page 3: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

thrombopenia

Metasearchers Provide Access to Text Databases

Metasearcher

NYTimesArchives

PubMed USPTO

...

thrombopenia 26,887...

...

thrombopenia 0...

...

thrombopenia 42...

??

Database selection relies on simple content summaries: vocabulary, word frequencies

Page 4: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Extracting Content Summaries from Text DatabasesFor hidden-web databases (query-only access):• Send queries to database• Retrieve top matching documents• Use document sample as database representative

For “crawlable” databases:• Retrieve documents by following links (crawling)• Stop when all documents retrieved

Content summary contains: Words in sample (or crawl) Document frequency of each

word in sample (or crawl)

PubMed (11,868,552 documents)

Word #Documents

aids 123,826 cancer 1,598,896 heart 706,537hepatitis 124,320thrombopenia 26,887

Page 5: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Never-update Policy

Current practice: construct summary once, never update Extracted (old) summary may:

Miss new words (from new documents) Contain obsolete words (from deleted document) Provide inaccurate frequency estimates

NY Times (Oct 29, 2004)

Word #Docs

NY Times (Mar 29, 2005)

Word #Docs

•tsunami (0) •recount 2,302•grokster 2

•tsunami 250•recount (0)•grokster 78

Page 6: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Research Challenge

Updating summaries is costly!

Challenge: Maintain good quality of summaries, and Minimize number of updates

If summaries do not change Problem solved!

If summaries change Estimate rate of change and schedule updates

Page 7: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Outline

Do content summaries change over time?

Which database properties affect the rate of change?

How to schedule updates with constrained resources?

Page 8: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Randomly picked from Open Directory

Multiple domains

Multiple topics

Searchable (to construct summaries by querying) Crawlable (to retrieve full contents)

Data for our Study: 152 Web Databases

www.wsj.com, www.intellihealth.com, www.fda.gov, www.si.edu, …

Page 9: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Study period: Oct 2002 – Oct 2003 52 weekly snapshots for each database 5 million pages in each snapshot (approx.) 65 Gb per snapshot (3.3 Tb total)

For each week and each database, we built: Complete summary (by scanning all pages)

Approximate summary (by query-based sampling)

Data for our Study: 152 Web Databases

Page 10: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Measuring Changes over Time

Recall: How many words in current summary also in old (extracted) summary? Shows how well old summaries

cover the current (unknown) vocabulary

Higher values are better

Precision: How many words in old (extracted) summary still in current summary? Shows how many obsolete

words exist in the old summaries

Higher values are better

Results for complete summaries (similar for approximate)

Page 11: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Summaries over Time: Conclusions

Databases (and their summaries) are not static

Quality of old summaries deteriorates over time

Quality decreases for both complete and approximate content summaries (see paper for details)

How often should we refresh the summaries?

Page 12: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Outline

Do content summaries change over time?

Which database properties affect the rate of change?

How to schedule updates with constrained resources?

Page 13: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Survival Analysis

Initially used to measure length of survival of patients under different treatments (hence the name)

Used to measure effect of different parameters (e.g., weight, race) on survival time

We want to predict “time until next update” and find database properties that affect this time

Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs”

Page 14: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Survival Analysis for Summary Updates

“Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i.e., an update is required)

Old summary changes at time t if:

KL divergence(current, old) > τ

Survival analysis estimates probability that a database summary changes within time t

change sensitivity threshold

Page 15: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Modeling Goals

Goal: Estimate database-specific survival time distribution

Exponential distribution S(t) = exp(-λt) common for survival times λ captures rate of change Need to estimate λ for each database Preferably, infer λ from database properties (with no “training”)

Intuitive (and wrong) approach: data + multiple regression Study contains a large number of “incomplete” observations Target variable S(t) typically not Gaussian

Page 16: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Survival Times and “Incomplete” Data

week

“Survival times” for a database

X

XX

XX

Week 52, end of study

“Censored” cases

Many observations are “incomplete” (aka “censored”) Censored data give partial information (database did not change)

Page 17: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Using “Censored” Data

S(t), best fit, ignoring censored data

S(t), best fit, using censored data

By ignoring censored cases we get (under) estimates perform more update operations than needed

By using censored cases “as-is” we get (again) underestimates Survival analysis “extends” the lifetime of “censored” cases

X

XX

XX

X

S(t), best fit, using censored data “as-is”

Page 18: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Database Properties and Survival Times

For our analysis, we use Cox Proportional Hazards Regression

Uses effectively “censored” data (i.e., database did not change within time T)

Derives effect of database properties on rate of change E.g., “if you double the size of a database, it changes twice as

fast” No assumptions about the form of the survival function

Page 19: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Rate of change increases

Rate of change decreases

Cox PH Regression Results

Examined effect of: Change-sensitivity threshold τ Topic Size Number of words Differences of summaries extracted in consecutive weeks

Domain

(higher τ longer survival)

(details in next slide)

(does not matter, except for health-related sites)

(larger databases change faster!)

(does not matter)

(sites that changed frequently in the past, change frequently in the future)

Page 20: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Baseline Survival Functions by Domain

Effect of domain:

GOV changes slower than any other domain

EDU changes fast in the short term, but slower in the long term

COM and other commercial sites change faster than the rest

Page 21: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Cox PH analysis gives a formula for predicting the time between updates for any database

Rate of change depends on: domain database size history of change threshold τ

Results of Cox PH Analysis

By knowing time between updates we can schedule update operations better!

Page 22: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Outline

Do content summaries change over time?

Which database properties affect the rate of change?

How to schedule updates with constrained resources?

Page 23: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Deriving an Update Policy

Naïve policy: Updates all databases at the same time (i.e., assumes

identical change rates) Suboptimal use of resources

Our policy: Use change rate as predicted by survival analysis Exploit database-specific estimates for rate of change

Page 24: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Scheduling Updates

Database Rate of change λ

average time between updates

10 weeks 40 weeks

Tom’s Hardware 0.088 5 weeks 46 weeks

USPS 0.023 12 weeks 34 weeks

With plentiful resources, we update sites according to their rate of change

When resources are constrained, we update less often sites that change “too frequently”

Page 25: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Scheduling Results

Clever scheduling improves quality of summaries (according to KL, precision and recall)

Our policy allows users to select optimally change thresholds according to available resources, or vice versa. (see paper)

Page 26: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Updating Content Summaries: Contributions

Extensive experimental study (1 year, 152 dbases): established the need to update periodically statistics (summaries) for text databases

Change frequency model: showed that database characteristics can predict time between updates

Scheduling algorithms: devised update policies that exploit “survival model” and use efficiently available resources

Page 27: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Current and Future Work

Current: Compared with machine learning techniques Applied technique for web crawling

Future: Apply survival analysis for refreshing db statistics

(materialized views, index statistics, …) Examine efficiency of survival analysis models Create generative models for modeling database

changes

Page 28: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Thank you! ( ありがとう )

Questions?

質問か。

Page 29: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Related Work

Brewington & Cybenko, WWW9, Computer 2000

Cho & Molina, VLDB 2000, SIGMOD 2000, TOIT 2003

Coffman, J.Scheduling, 1998

Olston & Widom, SIGMOD 2002

Page 30: Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia

Panos Ipeirotis – New York University

Measuring Changes over Time

KL divergence: How similar is the word distribution in old and current summaries? Identical summaries: KL=0 Higher values are worse

Results for complete summaries (similar for approximate)