trio meeting

25
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio

Upload: social-media-marketing

Post on 20-Jan-2015

212 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Trio Meeting

Trio: A System for Data, Uncertainty, and Lineage

Search “stanford trio”http://i.stanford.edu/trio

Page 2: Trio Meeting

2

People

Current• Jennifer Widom (faculty)• Omar Benjelloun (post-doc)• Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD)• Michi Mutsuzaki (MS)• Tomoe Sugihara (visitor)

Incoming• Martin Theobald (post-doc)• Raghu Murthy (MS)• Ander de Keijzer (visitor)

Alums• Alon Halevy, Ashok Chandra (visitors)• Chris Hayworth (MS)

Page 3: Trio Meeting

3

Why Uncertainty + Lineage?

Many applications seem to need both

From a technical standpoint, it turns out that

lineage...1. Enables simple and consistent

representation of uncertain data

2. Correlates uncertainty in query results with uncertainty in the input data

3. Can make computation over uncertain data more efficient

Page 4: Trio Meeting

4

Trio Components

1. Data Model ULDBs (Uncertainty-Lineage Databases): Simple extension to relational model

2. Query Language TriQL: Simple extension to SQL, well-defined

semantics and intuitive behavior

3. System Version 1: Complete system and GUI built

on top of conventional DBMS

Page 5: Trio Meeting

5

Running Example: Crime-Solving

Saw(witness,car) // may be uncertain

Drives(person,car) // may be uncertain

Suspects(person) = πperson(Saw ⋈ Drives)

Page 6: Trio Meeting

6

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

Page 7: Trio Meeting

7

Our Model for Uncertainty

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

witness car

Amy { Honda, Toyota, Mazda }

=

Three possibleinstances

Page 8: Trio Meeting

8

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

Page 9: Trio Meeting

9

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Saw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

Page 10: Trio Meeting

10

Models for Uncertainty

• Our model (so far) is not especially new

• We spent some time exploring the space of models for uncertainty [ICDE 06, journal]

• Tension between understandability and expressiveness– Our model is understandable

– But it is not complete, or even closed under common operations

Page 11: Trio Meeting

11

Our Model is Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 12: Trio Meeting

12

Lineage to the Rescue

Lineage• Captures “where data came from”

• In Trio: A function λ from alternatives to other alternatives (or external sources)

Page 13: Trio Meeting

13

Example with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)

λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

Page 14: Trio Meeting

14

Uncertainty-Lineage Databases (ULDBs)

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete[VLDB 06]

Page 15: Trio Meeting

15

ULDBs: Lineage

• Conjunctive lineage sufficient for most operations

• Duplicate-elimination: Disjunctive lineage

• Difference: Negative lineage

• General case after multiple operations/queries: Boolean formula

Page 16: Trio Meeting

16

ULDBs: Interesting Questions

• Data-minimality: extraneous alternatives, extraneous “?”

• Lineage-minimality: harder

• Membership: tuple and table, some-instance and all-instances

• Coexistence: multiple tuples

• Extraction: remove tables, retain possible-instances

Page 17: Trio Meeting

17

Example: Extraneous Data

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

??

Page 18: Trio Meeting

18

Example: Coexistence

Mazda

Acura

(Diane, Mazda) ∥ (Diane, Acura)

(Diane, Mazda)

(Diane, Acura)

?

??

?Can’t coexist

Page 19: Trio Meeting

19

Querying ULDBs: Semantics

Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

operational semanticsD + ResultD + Result

Page 20: Trio Meeting

20

Querying ULDBs: TriQL

Basic TriQL: SQL with new semantics• Obeys commutative diagram for uncertain data

• Tracks lineage

• Query results: new table or on-the-fly

Implemented TriQL: also built-in predicates conf(), lineage(), lineage*()

Page 21: Trio Meeting

21

Additional TriQL Constructs

[Language manual on web site]

• “Horizontal subqueries”Refer to tuple alternatives as a relation

• Unmerged (horizontal duplicates)

• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-specified confidences [done]

• Data modification statements

Page 22: Trio Meeting

22

Confidence Computation

• Confidences computed on-demand based on lineage—Confidence of alternative A is function of

confidences in λ*(A)

—Permits any query plan for data computation

• Default probabilistic interpretation, but queries can override

SELECT person, min(conf(Saw),conf(Drives)) as confFROM Saw, DrivesWHERE Saw.car = Drives.car

Page 23: Trio Meeting

23

Trio System: Version 1

Standard relational DBMS

Trio API and translator(Python)

Trio API and translator(Python)

Command-lineclient

Command-lineclient

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL• “Verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs

• Table types• Schema-level lineage structure

• conf()• lineage() “==>”• lineage*() “==>>”

• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

Page 24: Trio Meeting

24

Current & Future Topics

Algorithms: confidence computation, coexistence

extraneous data• Minimize lineage traversal• Memoization• Batch operations

System• Full query language• More internal processing ?

– Storage and indexing– Statistics and query optimization

Page 25: Trio Meeting

25

Current & Future Topics

• Top-K by confidence

• Extend basic uncertainty model—Incomplete relations

—Continuous uncertainty

—Correlated uncertainty ?

• External lineage, update lineage, versioning