example: sales prediction workflow - stanford university€¦ · example: sales prediction workflow...

12
PANDA PANDA A System for Provenance and Data A System for Provenance and Data Jennifer Widom Example: Sales Prediction Workflow CustList n CustList n CustList n-1 CustList n-1 CustList 2 CustList 2 CustList 1 CustList 1 Europe USA Dedup Union Predict Item Sales . . . . . . ItemAgg Catalog Items Catalog Items Buying Patterns Buying Patterns Split 2 Jennifer Widom ? Example: Sales Prediction Workflow CustList n CustList n CustList n-1 CustList n-1 CustList 2 CustList 2 CustList 1 CustList 1 Europe USA Dedup Union Predict Item Sales . . . ItemAgg Catalog Items Catalog Items Buying Patterns Buying Patterns Split Item Demand Cowboy Hat high Name Item Prob Cowboy Hat .98 Cowboy Hat .98 Cowboy Hat .98 Backward Tracing ? ? 3 Name Address Amelie … Paris, Texas Jacques … Paris, Texas Isabelle … Paris, Texas

Upload: others

Post on 26-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

PANDAPANDAA System for Provenance and DataA System for Provenance and Data

Jennifer Widom

Example: Sales Prediction Workflow

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

USA

Dedup Union Predict

ItemSalesItemSales

. . .. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

2

Jennifer Widom

?

Example: Sales Prediction Workflow

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

USA

Dedup Union Predict

ItemSalesItemSales

. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

Item DemandCowboy Hat high

Name Item ProbAmelie Cowboy Hat .98

Jacques Cowboy Hat .98Isabelle Cowboy Hat .98

Backward TracingBackward Tracing

??

3

Name AddressAmelie … Paris,

TexasJacques … Paris,

TexasIsabelle … Paris,

Texas

Page 2: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

USAUSA

Name AddressAmelie … Paris,

TexasJacques … Paris,

TexasIsabelle … Paris,

Texas

Example: Sales Prediction Workflow

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

Dedup Union Predict

ItemSalesItemSales

. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

Name AddressAmelie 65, quai d'Orsay, Paris

Jacques 39, rue de Bretagne, Paris

Isabelle 20, rue d‘Orsel, Paris

?

Backward TracingBackward Tracing

4

CustListnCustListn

Jennifer Widom

SplitSplit

EuropeEurope

USA

Dedup Union Predict ItemAgg

Name AddressAmelie … Paris, France

Jacques … Paris, France

Isabelle … Paris, France

Example: Sales Prediction Workflow

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

ItemSalesItemSales

. . .

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Name AddressAmelie 65, quai d'Orsay, Paris

Jacques 39, rue de Bretagne, Paris

Isabelle 20, rue d‘Orsel, Paris

Item DemandBeret high

Backward TracingBackward TracingForward PropagationForward Propagation

5

CustListnCustListnCustListnCustListn

Jennifer Widom

Provenance (very general)

Where data came from

How it was derived, manipulated, combined, processed, …

How it has evolved over time

6

Page 3: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Uses for Provenance (very general)

ExplanationSources and evolution of data; deeper understanding

VerificationBuggy or stale source data? Buggy processing?Auditing

RecomputationPropagate changes to affected “downstream” data

7

Jennifer Widom

Some Application Domains

Sales prediction workflows ☺

Scientific-data workflowsIncluding human-curated dataIncluding evolving versions of data

Any analytic pipeline

“Extract-transform-load” (ETL) processes

Information-extraction pipelines

8

Jennifer Widom

Third Time’s a Charm

1. Data Warehousing project (long ago)Lineage of relational views in warehouse:

formal foundations, system/caching issuesLineage in ETL pipelines: foundations & algorithms

2. “Trio” project (recently)Data + Uncertainty + LineageLineage primarily in support of uncertainty

Isn’t provenance the same thing as lineage?Isn’t provenance the same thing as lineage?

Haven’t you worked on it before?Haven’t you worked on it before?

Pretty muchPretty much

YesYes

9

Page 4: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Panda’s Ambitions

Previous provenance work tends to be…

Either data-based or process-based

Either fine-grained or coarse-grained

Focused on modeling and capturing provenance

Geared to specific functions or domains

10

Panda will…Panda will…

Capture both: “data-oriented workflows”Capture both: “data-oriented workflows”

Cover the spectrum in a unified fashionCover the spectrum in a unified fashion

Also support provenance operators and queriesAlso support provenance operators and queries

End with a general-purpose open-source system

End with a general-purpose open-source system

Jennifer Widom

Remainder of Talk

Processing nodes and provenance capture

Exploiting provenanceBuilt-in operationsAd-hoc queriesOther uses

Current research

11

Jennifer Widom

Union

Europe

USA

Dedup Predict ItemAggSplit Union ItemAggDedup Split

USA

Europe

Predict

Processing Nodes

RelationalStructured, well-understood operations

Opaque (or semi-opaque)Incomplete knowledge of internals

12

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

ItemSalesItemSales

. . .

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Page 5: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Provenance Capture — Model

Ultimate model yet to be determinedFrom bipartite graph to Open Provenance Model

Goals:Support provenance at spectrum of granularitiesMesh data-oriented and process-oriented provenanceComposability/transitivity

For remainder of talk: think bipartite graph

13

UnderstandabilityUnderstandability

Jennifer Widom

Processing nodes provide provenanceinformation along with output

Eager or Lazy

Provenance Capture — Interface

14

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

ItemSalesItemSales

. . .

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Europe

USA

Dedup Union Predict ItemAggSplit

Jennifer Widom

Relational, other well-understood transformationsAutomatic — previous work

Opaque (or semi-opaque)Provided via provenance interface, or inferred

Worst Case:No access to fine-grained provenance

Worst Case:No access to fine-grained provenance

Split

Europe

Dedup

USA

Split

Europe

Dedup

USA

Union ItemAggPredict

Provenance Capture — Interface

15

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

ItemSalesItemSales

. . .

CatalogItems

CatalogItems

BuyingPatternsBuying

PatternsStraightforward

provenanceStraightforward

provenance

Page 6: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Provenance Operations — Basic

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

USA

Dedup Union Predict

ItemSalesItemSales

. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

16

Backward tracingWhere did the Cowboy Hat record come from?

Forward tracingWhich sales predictions did Amelie contribute to?

Jennifer Widom

Additional Functionality

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

USA

Dedup Union Predict

ItemSalesItemSales

. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

17

Forward propagationUpdate all affected predictions after customersmove from Texas to France

Jennifer Widom

Additional Functionality

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

USA

Dedup Union Predict

ItemSalesItemSales

. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

18

RefreshGet latest prediction for Cowboy Hat sales (only)based on modified buying patterns

≈ Backward tracing + Forward propagation

Page 7: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Provenance Queries

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

USA

Dedup Union Predict

ItemSalesItemSales

. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

19

How many people from each country contributed to the Cowboy Hat prediction?

Which customer list contributed the most to the top 100 predicted items?

Jennifer Widom

Provenance Queries

CustListnCustListn

CustListn-1CustListn-1

CustList2CustList2

CustList1CustList1

Europe

USA

Dedup Union Predict

ItemSalesItemSales

. . . ItemAgg

CatalogItems

CatalogItems

BuyingPatternsBuying

Patterns

Split

20

For a specific customer list, which items have higher demand than for the entire customer set?

Which customers have more duplication — those processed by USA or by Europe?

Jennifer Widom

Provenance Queries

21

For a specific customer list, which items have higher demand than for the entire customer set?

Which customers have more duplication — those processed by USA or by Europe?

Query language goalsDeclarative ad-hoc queries à la database systems

Seamlessly combine provenance and data

Amenable to optimizationMany interesting optimization issues

Page 8: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Current Research

SystemWorkflow infrastructureBasic provenance capture & operations

AlgorithmsFor refresh problem(also foundations & implementation)

TheoryDecision problems: eager/lazy, intermediate dataProvenance foundations for semi-opaque transformations

22

Jennifer Widom

Current Research

SystemWorkflow infrastructureBasic provenance capture & operations

AlgorithmsFor refresh problem(also foundations & implementation)

TheoryDecision problems: eager/lazy, intermediate dataProvenance foundations for semi-opaque transformations

23

See recentpaper

See recentpaper

Jennifer Widom

Panda System (version 0.1)

24

SQLite

Panda Layer

Command-line Client

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

SQLTransformations

(user)

Graphical Interface

Forward FilterTables (Panda)

PythonTransformations

(user)

File System

Arbitrary acyclic workflowsSQL processing nodesautomatic capture

Python processing nodes one-one, one-many

Backward-tracing and forward-propagation

Page 9: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Future Work

25

The large majority of this talkStay tunedStay tuned

Backup slides

Jennifer Widom

Optimization Issues

Query-driven provenance capture

Eager vs. lazy provenance captureSpace-time & query-update tradeoffsProcessing-node dependent

Ex: Dedup, ItemAgg

Retain intermediate data sets?

Fine-grained vs. coarse-grained

Approximate provenance

27

Page 10: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Panda System (version 0.1)

28

SQLite

Panda Layer

Command-line Client

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

SQLTransformations

(user)

Graphical Interface

Forward FilterTables (Panda)

PythonTransformations

(user)

File System

Jennifer Widom

System: Key Features

Arbitrary acyclic workflows

SQL processing nodesAutomatic provenance capture

Python processing nodesOne-one and one-many transformations only

Backward-tracing and forward-propagation

29

Jennifer Widom

Panda System (version 0.1)

30

SQLite

Panda Layer

Command-line Client

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

SQLTransformations

(user)

Forward FilterTables (Panda)

PythonTransformations

(user)

File System

CreateTable

DataTables(user)

Graphical Interface

Page 11: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Panda System (version 0.1)

31

SQLite

Panda Layer

Command-line Client

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

SQLTransformations

(user)

Forward FilterTables (Panda)

PythonTransformations

(user)

File System

CreateSQL

Transformation

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

SQLTransformations

(user)

DataTables(user)

Graphical Interface

Jennifer Widom

Panda System (version 0.1)

32

SQLite

Panda Layer

Command-line Client

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

SQLTransformations

(user)

Forward FilterTables (Panda)

PythonTransformations

(user)

File System

CreatePython

Transformation

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

Forward FilterTables (Panda)

PythonTransformations

(user)

Graphical Interface

Jennifer Widom

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

SQLTransformations

(user)

Forward FilterTables (Panda)

PythonTransformations

(user)

Workflow Table

(Panda)

ProvenancePredicate Tables

(Panda)

DataTables(user)

Forward FilterTables (Panda)

PythonTransformations

(user)

SQLTransformations

(user)

Panda System (version 0.1)

33

SQLite

Panda Layer

Command-line Client

File System

RefreshOutputElement

Graphical Interface

Page 12: Example: Sales Prediction Workflow - Stanford University€¦ · Example: Sales Prediction Workflow CustList n CustList n-1 CustList 2 CustList 1 Item Sales. . . Catalog ItemsItems

Jennifer Widom

Provenance-Based Refresh

ProblemExploit provenance to efficiently compute the up-to-datevalue of selected output elements

Challenges:Formally defining provenance & refresh

When is new value up-to-date version of old one?Provenance as key

When can selective refresh be done correctly and efficiently?

Properties of transformations, workflows, provenance

Algorithms for wide class of transformations and workflows

34

See recentpaper

See recentpaper