trio: a system for data, uncertainty, and lineage search “stanford trio”

25
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio

Post on 21-Dec-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Trio: A System for Data, Uncertainty, and Lineage

Search “stanford trio”http://i.stanford.edu/trio

2

People

Current• Jennifer Widom (faculty)• Omar Benjelloun (post-doc)• Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD)• Michi Mutsuzaki (MS)• Tomoe Sugihara (visitor)

Incoming• Martin Theobald (post-doc)• Raghu Murthy (MS)• Ander de Keijzer (visitor)

Alums• Alon Halevy, Ashok Chandra (visitors)• Chris Hayworth (MS)

3

Why Uncertainty + Lineage?

Many applications seem to need both

From a technical standpoint, it turns out that

lineage...1. Enables simple and consistent

representation of uncertain data

2. Correlates uncertainty in query results with uncertainty in the input data

3. Can make computation over uncertain data more efficient

4

Trio Components

1. Data Model ULDBs (Uncertainty-Lineage Databases): Simple extension to relational model

2. Query Language TriQL: Simple extension to SQL, well-defined

semantics and intuitive behavior

3. System Version 1: Complete system and GUI built

on top of conventional DBMS

5

Running Example: Crime-Solving

Saw(witness,car) // may be uncertain

Drives(person,car) // may be uncertain

Suspects(person) = πperson(Saw ⋈ Drives)

6

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

7

Our Model for Uncertainty

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

witness car

Amy { Honda, Toyota, Mazda }

=

Three possibleinstances

8

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

9

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Saw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

10

Models for Uncertainty

• Our model (so far) is not especially new

• We spent some time exploring the space of models for uncertainty [ICDE 06, journal]

• Tension between understandability and expressiveness– Our model is understandable

– But it is not complete, or even closed under common operations

11

Our Model is Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

12

Lineage to the Rescue

Lineage• Captures “where data came from”

• In Trio: A function λ from alternatives to other alternatives (or external sources)

13

Example with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)

λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

14

Uncertainty-Lineage Databases (ULDBs)

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete[VLDB 06]

15

ULDBs: Lineage

• Conjunctive lineage sufficient for most operations

• Duplicate-elimination: Disjunctive lineage

• Difference: Negative lineage

• General case after multiple operations/queries: Boolean formula

16

ULDBs: Interesting Questions

• Data-minimality: extraneous alternatives, extraneous “?”

• Lineage-minimality: harder

• Membership: tuple and table, some-instance and all-instances

• Coexistence: multiple tuples

• Extraction: remove tables, retain possible-instances

17

Example: Extraneous Data

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

??

18

Example: Coexistence

Mazda

Acura

(Diane, Mazda) ∥ (Diane, Acura)

(Diane, Mazda)

(Diane, Acura)

?

??

?Can’t coexist

19

Querying ULDBs: Semantics

Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

operational semanticsD + ResultD + Result

20

Querying ULDBs: TriQL

Basic TriQL: SQL with new semantics• Obeys commutative diagram for uncertain data

• Tracks lineage

• Query results: new table or on-the-fly

Implemented TriQL: also built-in predicates conf(), lineage(), lineage*()

21

Additional TriQL Constructs

[Language manual on web site]

• “Horizontal subqueries”Refer to tuple alternatives as a relation

• Unmerged (horizontal duplicates)

• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-specified confidences [done]

• Data modification statements

22

Confidence Computation

• Confidences computed on-demand based on lineage—Confidence of alternative A is function of

confidences in λ*(A)

—Permits any query plan for data computation

• Default probabilistic interpretation, but queries can override

SELECT person, min(conf(Saw),conf(Drives)) as confFROM Saw, DrivesWHERE Saw.car = Drives.car

23

Trio System: Version 1

Standard relational DBMS

Trio API and translator(Python)

Trio API and translator(Python)

Command-lineclient

Command-lineclient

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL• “Verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs

• Table types• Schema-level lineage structure

• conf()• lineage() “==>”• lineage*() “==>>”

• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

24

Current & Future Topics

Algorithms: confidence computation, coexistence

extraneous data• Minimize lineage traversal• Memoization• Batch operations

System• Full query language• More internal processing ?

– Storage and indexing– Statistics and query optimization

25

Current & Future Topics

• Top-K by confidence

• Extend basic uncertainty model—Incomplete relations

—Continuous uncertainty

—Correlated uncertainty ?

• External lineage, update lineage, versioning