trio: a system for data, uncertainty, and lineage

25
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio

Upload: lindsey

Post on 07-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

UNCERTAINTY. LINEAGE. DATA. Trio: A System for Data, Uncertainty, and Lineage. Search “stanford trio” http://i.stanford.edu/trio. People. Current Jennifer Widom (faculty) Omar Benjelloun (post-doc) Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) Michi Mutsuzaki (MS) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Trio:  A System for Data, Uncertainty, and Lineage

Trio: A System for Data, Uncertainty, and Lineage

Search “stanford trio”http://i.stanford.edu/trio

Page 2: Trio:  A System for Data, Uncertainty, and Lineage

2

People

Current• Jennifer Widom (faculty)• Omar Benjelloun (post-doc)• Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD)• Michi Mutsuzaki (MS)• Tomoe Sugihara (visitor)

Incoming• Martin Theobald (post-doc)• Raghu Murthy (MS)• Ander de Keijzer (visitor)

Alums• Alon Halevy, Ashok Chandra (visitors)• Chris Hayworth (MS)

Page 3: Trio:  A System for Data, Uncertainty, and Lineage

3

Why Uncertainty + Lineage?

Many applications seem to need both

From a technical standpoint, it turns out that

lineage...1. Enables simple and consistent

representation of uncertain data

2. Correlates uncertainty in query results with uncertainty in the input data

3. Can make computation over uncertain data more efficient

Page 4: Trio:  A System for Data, Uncertainty, and Lineage

4

Trio Components

1. Data Model ULDBs (Uncertainty-Lineage Databases): Simple extension to relational model

2. Query Language TriQL: Simple extension to SQL, well-defined

semantics and intuitive behavior

3. System Version 1: Complete system and GUI built

on top of conventional DBMS

Page 5: Trio:  A System for Data, Uncertainty, and Lineage

5

Running Example: Crime-Solving

Saw(witness,car) // may be uncertain

Drives(person,car) // may be uncertain

Suspects(person) = πperson(Saw ⋈ Drives)

Page 6: Trio:  A System for Data, Uncertainty, and Lineage

6

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

Page 7: Trio:  A System for Data, Uncertainty, and Lineage

7

Our Model for Uncertainty

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

witness car

Amy { Honda, Toyota, Mazda }

=

Three possibleinstances

Page 8: Trio:  A System for Data, Uncertainty, and Lineage

8

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

Page 9: Trio:  A System for Data, Uncertainty, and Lineage

9

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Saw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

Page 10: Trio:  A System for Data, Uncertainty, and Lineage

10

Models for Uncertainty

• Our model (so far) is not especially new

• We spent some time exploring the space of models for uncertainty [ICDE 06, journal]

• Tension between understandability and expressiveness– Our model is understandable

– But it is not complete, or even closed under common operations

Page 11: Trio:  A System for Data, Uncertainty, and Lineage

11

Our Model is Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 12: Trio:  A System for Data, Uncertainty, and Lineage

12

Lineage to the Rescue

Lineage• Captures “where data came from”

• In Trio: A function λ from alternatives to other alternatives (or external sources)

Page 13: Trio:  A System for Data, Uncertainty, and Lineage

13

Example with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)

λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

Page 14: Trio:  A System for Data, Uncertainty, and Lineage

14

Uncertainty-Lineage Databases (ULDBs)

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete[VLDB 06]

Page 15: Trio:  A System for Data, Uncertainty, and Lineage

15

ULDBs: Lineage

• Conjunctive lineage sufficient for most operations

• Duplicate-elimination: Disjunctive lineage

• Difference: Negative lineage

• General case after multiple operations/queries: Boolean formula

Page 16: Trio:  A System for Data, Uncertainty, and Lineage

16

ULDBs: Interesting Questions

• Data-minimality: extraneous alternatives, extraneous “?”

• Lineage-minimality: harder

• Membership: tuple and table, some-instance and all-instances

• Coexistence: multiple tuples

• Extraction: remove tables, retain possible-instances

Page 17: Trio:  A System for Data, Uncertainty, and Lineage

17

Example: Extraneous Data

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

??

Page 18: Trio:  A System for Data, Uncertainty, and Lineage

18

Example: Coexistence

Mazda

Acura

(Diane, Mazda) ∥ (Diane, Acura)

(Diane, Mazda)

(Diane, Acura)

?

??

?Can’t coexist

Page 19: Trio:  A System for Data, Uncertainty, and Lineage

19

Querying ULDBs: Semantics

Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

operational semanticsD + ResultD + Result

Page 20: Trio:  A System for Data, Uncertainty, and Lineage

20

Querying ULDBs: TriQL

Basic TriQL: SQL with new semantics• Obeys commutative diagram for uncertain data

• Tracks lineage

• Query results: new table or on-the-fly

Implemented TriQL: also built-in predicates conf(), lineage(), lineage*()

Page 21: Trio:  A System for Data, Uncertainty, and Lineage

21

Additional TriQL Constructs

[Language manual on web site]

• “Horizontal subqueries”Refer to tuple alternatives as a relation

• Unmerged (horizontal duplicates)

• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-specified confidences [done]

• Data modification statements

Page 22: Trio:  A System for Data, Uncertainty, and Lineage

22

Confidence Computation

• Confidences computed on-demand based on lineage—Confidence of alternative A is function of

confidences in λ*(A)

—Permits any query plan for data computation

• Default probabilistic interpretation, but queries can override

SELECT person, min(conf(Saw),conf(Drives)) as confFROM Saw, DrivesWHERE Saw.car = Drives.car

Page 23: Trio:  A System for Data, Uncertainty, and Lineage

23

Trio System: Version 1

Standard relational DBMS

Trio API and translator(Python)

Trio API and translator(Python)

Command-lineclient

Command-lineclient

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL• “Verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs

• Table types• Schema-level lineage structure

• conf()• lineage() “==>”• lineage*() “==>>”

• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

Page 24: Trio:  A System for Data, Uncertainty, and Lineage

24

Current & Future Topics

Algorithms: confidence computation, coexistence

extraneous data• Minimize lineage traversal• Memoization• Batch operations

System• Full query language• More internal processing ?

– Storage and indexing– Statistics and query optimization

Page 25: Trio:  A System for Data, Uncertainty, and Lineage

25

Current & Future Topics

• Top-K by confidence

• Extend basic uncertainty model—Incomplete relations

—Continuous uncertainty

—Correlated uncertainty ?

• External lineage, update lineage, versioning