uncertainty lineage data bases very large data bases 19752006

37
Uncertainty Lineage Data Bases Very Large Data Bases 1975 2006

Upload: david-singleton

Post on 18-Dec-2015

225 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

UncertaintyLineage

DataBases

VeryLargeDataBases

19752006

Page 2: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

ULDBs: Databases with ULDBs: Databases with Uncertainty and LineageUncertainty and Lineage

Omar Benjelloun, Anish Das Sarma,

Alon Halevy, Jennifer WidomStanford InfoLab

Page 3: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

33

MotMotivivationation

• Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete,

imprecise, fuzzy, inaccurate, ...)

• Many of the same applications need to track the lineage of their data

Neither uncertainty nor lineage are supported by conventional DBMSs

Coincidence or Fate?Coincidence or Fate?

Page 4: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

44

Sample Applications Needing Sample Applications Needing Uncertainty and LineageUncertainty and Lineage

•Scientific databases

•Sensor databases

•Data cleaning

•Data integration

• Information extraction

Page 5: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

55

Trio ProjectTrio Project

Building a new kind of DBMS in which:1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

Page 6: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

66

Lineage and UncertaintyLineage and Uncertainty

•Lots of independent work in lineage and uncertainty (related work at end of talk)

•Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications

Coincidence or Fate?Coincidence or Fate?

Page 7: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

77

Lineage and UncertaintyLineage and Uncertainty

• Lineage...1) Enables simple and consistent

representation of uncertain data2) Correlates uncertainty in query results

with uncertainty in the input data3) Can make computation over uncertain

data more efficient

Page 8: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

88

Outline of the TalkOutline of the Talk

•The ULDB data model

•Querying ULDBs

•ULDB properties

•Membership and extraction operations

•Confidences

•Current, related, and future work

Page 9: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

99

Running Example: Crime Running Example: Crime SolverSolver

Saw(witness,car) Drives(person,car)

Suspects(person) = πperson(Saw ⋈ Drives)

Page 10: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1010

UncertaintyUncertainty

•An uncertain database represents a set of possible instances. Examples:– Amy saw either a Honda or a Toyota

– Jimmy drives a Toyota, a Mazda, or both

– Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3

– Hank is a suspect with confidence 0.7

Page 11: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1111

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences

Page 12: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1212

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

Three possibleinstances

witness carAmy { Honda, Toyota,

Mazda }

=

Page 13: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1313

Six possibleinstances

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives2. ‘?’ (Maybe): uncertainty about

presence3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

Page 14: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1414

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences: weighted

uncertaintySaw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

Page 15: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1515

Data Models for Data Models for UncertaintyUncertainty•Our model (so far) is not especially new•We spent some time exploring the

space of models for uncertainty [ICDE 2006]

•Tension between understandability and expressiveness– Our model is understandable– But it is not complete, or even closed under

common operations

Page 16: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1616

Closure and Closure and CompletenessCompleteness

•Completeness Can represent all sets of possible

instances

•Closure Can represent results of operations

•Note: Completeness Closure

Page 17: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1717

Model (so far) Not ClosedModel (so far) Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

?

??

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 18: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1818

LineageLineage to the Rescue to the Rescue

•Lineage: “where data came from”– Internal lineage– External lineage (not covered in this

talk)

• In ULDBs: A function λ from alternatives to sets of alternatives (or external sources)

Page 19: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

1919

Example with LineageExample with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

?

?

?

Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)

λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

Page 20: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2020

ULDBsULDBs

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences4. Lineage

ULDBs are Closed and Complete

Page 21: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2121

Outline of the TalkOutline of the Talk

•The ULDB data model

•Querying ULDBs

•ULDB properties

•Membership and extraction operations

•Confidences

•Current, related, and future work

Page 22: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2222

Querying ULDBsQuerying ULDBs

•Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

D + ResultD + Result

Page 23: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2323

Well-Behaved ULDBsWell-Behaved ULDBs

• If we start with a well-behaved ULDB and perform standard queries, it remains well-behaved

• Intuitively (details in paper):– Acyclic: No cycles in the lineage– Deterministic: Non-empty lineages of distinct

alternatives are distinct– Uniform: Alternatives of same tuple are derived

from the same set of tuples

Page 24: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2424

ULDB MinimalityULDB Minimality

• Data-minimality• Does every alternative appear in some

possible instance? (no extraneous alternatives)

• Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)

•Lineage-minimality

Page 25: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2525

Data-Minimality ExamplesData-Minimality ExamplesExtraneous ‘?’

. . .

10 (Billy, Honda) ∥ (Frank, Honda)

. . .

. . .

20

Billy ∥ Frank

. . .

?λ(20,1)=(10,1); λ(20,2)=(10,2)

extraneous

Page 26: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2626

Data-Minimality ExamplesData-Minimality ExamplesExtraneous alternative

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

? ?

Page 27: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2727

Data-MinimizationData-Minimization

• Extraneous alternative theorem:– An alternative is extraneous iff it is (possibly

transitively) derived from multiple alternatives of the same tuple.

• Extraneous “?” theorem– A “?” on tuple t is extraneous iff

1. it is derived from base tuples without “?”2. t has as many alternatives as the product of the

number in its base tuples

• Minimization algorithm based on the theorems

(see paper)

Page 28: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2828

Data-minimalLineage-minimal

Lineage-minimal

Data-minimal

Data-minimize

Queries

Lineage-minimize

ULDB Properties and ULDB Properties and OperationsOperations

Mem

bers

hip

Extraction

Page 29: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2929

Membership QuestionsMembership Questions

• Does a given tuple t appear in some (all) possible instance(s) of R ?– Polynomial algorithms based on Data-

minimization

• Is a given table T one of (all of) the possible instances of R ?– NP-Hard

t?, T?

RR

I1, I2, …, InI1, I2, …, In

possibleinstances

Page 30: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

3030

ExtractionExtraction

•Extraction algorithm in paper

Suspects Eats

Saw Drives

Page 31: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

3131

Outline of the TalkOutline of the Talk

•The ULDB data model

•Querying ULDBs

•ULDB properties

•Membership and extraction operations

•Confidences

•Current, related, and future work

Page 32: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

3232

ConfidencesConfidences

• Confidences supplied with base data

• Trio computes confidences on query results– Default probabilistic interpretation– Can choose to plug in different arithmetic

Saw (witness,car)

(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4

Drives (person,car)

(Jimmy, Mazda): 0.3 ∥ (Bill, Mazda): 0.6

(Hank, Honda): 1.0

Suspects

Jimmy: 0.12 ∥ Bill: 0.24

Hank: 0.6

?

?

?

0.3 0.4

0.6

ProbabilisticProbabilistic Min

Page 33: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

3333

Query Processing with Query Processing with ConfidencesConfidences• Previous approach (probabilistic

databases)– Each operator computes confidences during

query execution– Only certain query plans allowed

• In ULDBs– Confidence of alternative A is function of

confidences in its transitive lineage

• Our approach: Decouple data and confidence computation– Use any query plan for data computation– Compute confidences on-demand using

lineage– Can give arbitrarily large improvements

Page 34: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

3434

Current Work: AlgorithmsCurrent Work: Algorithms

•Algorithms: confidence computation, extraneous data, membership questions– Minimize lineage traversal– Memoization– Batch computations

Page 35: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

3535

The Trio TrioThe Trio Trio

• Data Model– ULDBs (Coming: incomplete relations;

continuous uncertainty; correlation uncertainty)

• Query Language– Simple extension to SQL– Query uncertainty, confidences, and lineage

• System– Did you see our demo? – Version 1: Entirely on top of conventional DBMS– Surprisingly easy and complete, reasonably

efficient

TriQL

Page 36: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

3636

Brief Related WorkBrief Related Work

•Uncertainty– Modeling

•C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90]

– Systems•ProbView [LLRS97], MYSTIQ [BDM+05],

ORION [CSP05], Trio [BDHW05]

•Lineage– DBNotes [CTV05], Data Warehouses

[CW03]

Page 37: Uncertainty Lineage Data Bases Very Large Data Bases 19752006

but don’t forgetthe lineage…

Thank YouThank You

Search “stanford trio”

(or, http://i.stanford.edu/trio)