modeling and managing content changes in text databases panos ipeirotis new york university...
Post on 26-Mar-2015
Embed Size (px)
- Slide 1
Modeling and Managing Content Changes in Text Databases Panos Ipeirotis New York University Alexandros Ntoulas UCLA Junghoo Cho UCLA Luis Gravano Columbia University Slide 2 Panos Ipeirotis New York University thrombopenia Metasearchers Provide Access to Text Databases Metasearcher NYTimes Archives PubMed USPTO Broadcasting queries to all databases not feasible (~100,000 DBs) Large number of hidden- web databases available Contents not accessible through Google Need to query each database separately Slide 3 Panos Ipeirotis New York University thrombopenia Metasearchers Provide Access to Text Databases Metasearcher NYTimes Archives PubMed USPTO... thrombopenia 26,887... thrombopenia 0... thrombopenia 42... ? Database selection relies on simple content summaries: vocabulary, word frequencies Slide 4 Panos Ipeirotis New York University Extracting Content Summaries from Text Databases For hidden-web databases (query-only access): Send queries to database Retrieve top matching documents Use document sample as database representative For crawlable databases: Retrieve documents by following links (crawling) Stop when all documents retrieved Content summary contains: Words in sample (or crawl) Document frequency of each word in sample (or crawl) PubMed (11,868,552 documents) Word #Documents aids 123,826 cancer1,598,896 heart 706,537 hepatitis 124,320 thrombopenia 26,887 Slide 5 Panos Ipeirotis New York University Never-update Policy Current practice: construct summary once, never update Extracted (old) summary may: Miss new words (from new documents) Contain obsolete words (from deleted document) Provide inaccurate frequency estimates NY Times (Oct 29, 2004) Word#Docs NY Times (Mar 29, 2005) Word#Docs tsunami(0) recount2,302 grokster2 tsunami250 recount(0) grokster78 Slide 6 Panos Ipeirotis New York University Research Challenge Updating summaries is costly! Challenge: Maintain good quality of summaries, and Minimize number of updates If summaries do not change Problem solved! If summaries change Estimate rate of change and schedule updates Slide 7 Panos Ipeirotis New York University Outline Do content summaries change over time? Which database properties affect the rate of change? How to schedule updates with constrained resources? Slide 8 Panos Ipeirotis New York University Randomly picked from Open Directory Multiple domains Multiple topics Searchable (to construct summaries by querying) Crawlable (to retrieve full contents) Data for our Study: 152 Web Databases www.wsj.comwww.wsj.com, www.intellihealth.com, www.fda.gov, www.si.edu, www.intellihealth.comwww.fda.govwww.si.edu Slide 9 Panos Ipeirotis New York University Study period: Oct 2002 Oct 2003 52 weekly snapshots for each database 5 million pages in each snapshot (approx.) 65 Gb per snapshot (3.3 Tb total) For each week and each database, we built: Complete summary (by scanning all pages) Approximate summary (by query-based sampling) Data for our Study: 152 Web Databases Slide 10 Panos Ipeirotis New York University Measuring Changes over Time Recall: How many words in current summary also in old (extracted) summary? Shows how well old summaries cover the current (unknown) vocabulary Higher values are better Precision: How many words in old (extracted) summary still in current summary? Shows how many obsolete words exist in the old summaries Higher values are better Results for complete summaries (similar for approximate) Slide 11 Panos Ipeirotis New York University Summaries over Time: Conclusions Databases (and their summaries) are not static Quality of old summaries deteriorates over time Quality decreases for both complete and approximate content summaries (see paper for details) How often should we refresh the summaries? Slide 12 Panos Ipeirotis New York University Outline Do content summaries change over time? Which database properties affect the rate of change? How to schedule updates with constrained resources? Slide 13 Panos Ipeirotis New York University Survival Analysis Initially used to measure length of survival of patients under different treatments (hence the name) Used to measure effect of different parameters (e.g., weight, race) on survival time We want to predict time until next update and find database properties that affect this time Survival Analysis: A collection of statistical techniques for predicting the time until an event occurs Slide 14 Panos Ipeirotis New York University Survival Analysis for Summary Updates Survival time of summary: Time until current database summary is sufficiently different than the old one (i.e., an update is required) Old summary changes at time t if: KL divergence(current, old) > Survival analysis estimates probability that a database summary changes within time t change sensitivity threshold Slide 15 Panos Ipeirotis New York University Modeling Goals Goal: Estimate database-specific survival time distribution Exponential distribution S(t) = exp(-t) common for survival times captures rate of change Need to estimate for each database Preferably, infer from database properties (with no training) Intuitive (and wrong) approach: data + multiple regression Study contains a large number of incomplete observations Target variable S(t) typically not Gaussian Slide 16 Panos Ipeirotis New York University Survival Times and Incomplete Data week Survival times for a database X X X X X Week 52, end of study Censored cases Many observations are incomplete (aka censored) Censored data give partial information (database did not change) Slide 17 Panos Ipeirotis New York University Using Censored Data S(t), best fit, ignoring censored data S(t), best fit, using censored data By ignoring censored cases we get (under) estimates perform more update operations than needed By using censored cases as-is we get (again) underestimates Survival analysis extends the lifetime of censored cases X X X X X X S(t), best fit, using censored data as-is Slide 18 Panos Ipeirotis New York University Database Properties and Survival Times For our analysis, we use Cox Proportional Hazards Regression Uses effectively censored data (i.e., database did not change within time T) Derives effect of database properties on rate of change E.g., if you double the size of a database, it changes twice as fast No assumptions about the form of the survival function Slide 19 Panos Ipeirotis New York University Rate of change increases Rate of change decreases Cox PH Regression Results Examined effect of: Change-sensitivity threshold Topic Size Number of words Differences of summaries extracted in consecutive weeks Domain (higher longer survival) (details in next slide) (does not matter, except for health-related sites) (larger databases change faster!) (does not matter) (sites that changed frequently in the past, change frequently in the future) Slide 20 Panos Ipeirotis New York University Baseline Survival Functions by Domain Effect of domain: GOV changes slower than any other domain EDU changes fast in the short term, but slower in the long term COM and other commercial sites change faster than the rest Slide 21 Panos Ipeirotis New York University Cox PH analysis gives a formula for predicting the time between updates for any database Rate of change depends on: d omain database size history of change threshold Results of Cox PH Analysis By knowing time between updates we can schedule update operations better! Slide 22 Panos Ipeirotis New York University Outline Do content summaries change over time? Which database properties affect the rate of change? How to schedule updates with constrained resources? Slide 23 Panos Ipeirotis New York University Deriving an Update Policy Nave policy: Updates all databases at the same time (i.e., assumes identical change rates) Suboptimal use of resources Our policy: Use change rate as predicted by survival analysis Exploit database-specific estimates for rate of change Slide 24 Panos Ipeirotis New York University Scheduling Updates DatabaseRate of change average time between updates 10 weeks40 weeks Toms Hardware0.0885 weeks46 weeks USPS0.02312 weeks34 weeks With plentiful resources, we update sites according to their rate of change When resources are constrained, we update less often sites that change too frequently Slide 25 Panos Ipeirotis New York University Scheduling Results Clever scheduling improves quality of summaries (according to KL, precision and recall) Our policy allows users to select optimally change thresholds according to available resources, or vice versa. (see paper) Slide 26 Panos Ipeirotis New York University Updating Content Summaries: Contributions Extensive experimental study (1 year, 152 dbases): established the need to update periodically statistics (summaries) for text databases Change frequency model: showed that database characteristics can predict time between updates Scheduling algorithms: devised update policies that exploit survival model and use efficiently available resources Slide 27 Panos Ipeirotis New York University Current and Future Work Current: Compared with machine learning techniques Applied technique for web crawling Future: Apply survival analysis for refreshing db statistics (materialized views, index statistics, ) Examine efficiency of survival analysis models Create generative models for modeling database changes Slide 28 Panos Ipeirotis New York University Thank you! ( ) Questions? Slide 29 Panos Ipeirotis New York University Related Work Brewington & Cybenko, WWW9, Computer 2000 Cho & Molina, VLDB 2000, SIGMOD 2000, TOIT 2003 Coffman, J.Scheduling, 1998 Olston & Widom, SIGMOD 2002 Slide 30 Panos Ipeirotis New York University Measuring Changes over Time KL divergence: How similar is the word distribution in old and current summaries? Identical summaries: KL=0 Higher values are worse Results for complete summaries (similar for approximate)