example: sales prediction workflow - stanford university€¦ · example: sales prediction workflow...
TRANSCRIPT
PANDAPANDAA System for Provenance and DataA System for Provenance and Data
Jennifer Widom
Example: Sales Prediction Workflow
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
USA
Dedup Union Predict
ItemSalesItemSales
. . .. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
2
Jennifer Widom
?
Example: Sales Prediction Workflow
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
USA
Dedup Union Predict
ItemSalesItemSales
. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
Item DemandCowboy Hat high
Name Item ProbAmelie Cowboy Hat .98
Jacques Cowboy Hat .98Isabelle Cowboy Hat .98
Backward TracingBackward Tracing
??
3
Name AddressAmelie … Paris,
TexasJacques … Paris,
TexasIsabelle … Paris,
Texas
Jennifer Widom
USAUSA
Name AddressAmelie … Paris,
TexasJacques … Paris,
TexasIsabelle … Paris,
Texas
Example: Sales Prediction Workflow
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
Dedup Union Predict
ItemSalesItemSales
. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
Name AddressAmelie 65, quai d'Orsay, Paris
Jacques 39, rue de Bretagne, Paris
Isabelle 20, rue d‘Orsel, Paris
?
Backward TracingBackward Tracing
4
CustListnCustListn
Jennifer Widom
SplitSplit
EuropeEurope
USA
Dedup Union Predict ItemAgg
Name AddressAmelie … Paris, France
Jacques … Paris, France
Isabelle … Paris, France
Example: Sales Prediction Workflow
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
ItemSalesItemSales
. . .
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Name AddressAmelie 65, quai d'Orsay, Paris
Jacques 39, rue de Bretagne, Paris
Isabelle 20, rue d‘Orsel, Paris
Item DemandBeret high
Backward TracingBackward TracingForward PropagationForward Propagation
5
CustListnCustListnCustListnCustListn
Jennifer Widom
Provenance (very general)
Where data came from
How it was derived, manipulated, combined, processed, …
How it has evolved over time
6
Jennifer Widom
Uses for Provenance (very general)
ExplanationSources and evolution of data; deeper understanding
VerificationBuggy or stale source data? Buggy processing?Auditing
RecomputationPropagate changes to affected “downstream” data
7
Jennifer Widom
Some Application Domains
Sales prediction workflows ☺
Scientific-data workflowsIncluding human-curated dataIncluding evolving versions of data
Any analytic pipeline
“Extract-transform-load” (ETL) processes
Information-extraction pipelines
8
Jennifer Widom
Third Time’s a Charm
1. Data Warehousing project (long ago)Lineage of relational views in warehouse:
formal foundations, system/caching issuesLineage in ETL pipelines: foundations & algorithms
2. “Trio” project (recently)Data + Uncertainty + LineageLineage primarily in support of uncertainty
Isn’t provenance the same thing as lineage?Isn’t provenance the same thing as lineage?
Haven’t you worked on it before?Haven’t you worked on it before?
Pretty muchPretty much
YesYes
9
Jennifer Widom
Panda’s Ambitions
Previous provenance work tends to be…
Either data-based or process-based
Either fine-grained or coarse-grained
Focused on modeling and capturing provenance
Geared to specific functions or domains
10
Panda will…Panda will…
Capture both: “data-oriented workflows”Capture both: “data-oriented workflows”
Cover the spectrum in a unified fashionCover the spectrum in a unified fashion
Also support provenance operators and queriesAlso support provenance operators and queries
End with a general-purpose open-source system
End with a general-purpose open-source system
Jennifer Widom
Remainder of Talk
Processing nodes and provenance capture
Exploiting provenanceBuilt-in operationsAd-hoc queriesOther uses
Current research
11
Jennifer Widom
Union
Europe
USA
Dedup Predict ItemAggSplit Union ItemAggDedup Split
USA
Europe
Predict
Processing Nodes
RelationalStructured, well-understood operations
Opaque (or semi-opaque)Incomplete knowledge of internals
12
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
ItemSalesItemSales
. . .
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Jennifer Widom
Provenance Capture — Model
Ultimate model yet to be determinedFrom bipartite graph to Open Provenance Model
Goals:Support provenance at spectrum of granularitiesMesh data-oriented and process-oriented provenanceComposability/transitivity
For remainder of talk: think bipartite graph
13
UnderstandabilityUnderstandability
Jennifer Widom
Processing nodes provide provenanceinformation along with output
Eager or Lazy
Provenance Capture — Interface
14
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
ItemSalesItemSales
. . .
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Europe
USA
Dedup Union Predict ItemAggSplit
Jennifer Widom
Relational, other well-understood transformationsAutomatic — previous work
Opaque (or semi-opaque)Provided via provenance interface, or inferred
Worst Case:No access to fine-grained provenance
Worst Case:No access to fine-grained provenance
Split
Europe
Dedup
USA
Split
Europe
Dedup
USA
Union ItemAggPredict
Provenance Capture — Interface
15
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
ItemSalesItemSales
. . .
CatalogItems
CatalogItems
BuyingPatternsBuying
PatternsStraightforward
provenanceStraightforward
provenance
Jennifer Widom
Provenance Operations — Basic
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
USA
Dedup Union Predict
ItemSalesItemSales
. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
16
Backward tracingWhere did the Cowboy Hat record come from?
Forward tracingWhich sales predictions did Amelie contribute to?
Jennifer Widom
Additional Functionality
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
USA
Dedup Union Predict
ItemSalesItemSales
. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
17
Forward propagationUpdate all affected predictions after customersmove from Texas to France
Jennifer Widom
Additional Functionality
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
USA
Dedup Union Predict
ItemSalesItemSales
. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
18
RefreshGet latest prediction for Cowboy Hat sales (only)based on modified buying patterns
≈ Backward tracing + Forward propagation
Jennifer Widom
Provenance Queries
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
USA
Dedup Union Predict
ItemSalesItemSales
. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
19
How many people from each country contributed to the Cowboy Hat prediction?
Which customer list contributed the most to the top 100 predicted items?
Jennifer Widom
Provenance Queries
CustListnCustListn
CustListn-1CustListn-1
CustList2CustList2
CustList1CustList1
Europe
USA
Dedup Union Predict
ItemSalesItemSales
. . . ItemAgg
CatalogItems
CatalogItems
BuyingPatternsBuying
Patterns
Split
20
For a specific customer list, which items have higher demand than for the entire customer set?
Which customers have more duplication — those processed by USA or by Europe?
Jennifer Widom
Provenance Queries
21
For a specific customer list, which items have higher demand than for the entire customer set?
Which customers have more duplication — those processed by USA or by Europe?
Query language goalsDeclarative ad-hoc queries à la database systems
Seamlessly combine provenance and data
Amenable to optimizationMany interesting optimization issues
Jennifer Widom
Current Research
SystemWorkflow infrastructureBasic provenance capture & operations
AlgorithmsFor refresh problem(also foundations & implementation)
TheoryDecision problems: eager/lazy, intermediate dataProvenance foundations for semi-opaque transformations
22
Jennifer Widom
Current Research
SystemWorkflow infrastructureBasic provenance capture & operations
AlgorithmsFor refresh problem(also foundations & implementation)
TheoryDecision problems: eager/lazy, intermediate dataProvenance foundations for semi-opaque transformations
23
See recentpaper
See recentpaper
Jennifer Widom
Panda System (version 0.1)
24
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Graphical Interface
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
Arbitrary acyclic workflowsSQL processing nodesautomatic capture
Python processing nodes one-one, one-many
Backward-tracing and forward-propagation
Jennifer Widom
Future Work
25
The large majority of this talkStay tunedStay tuned
Backup slides
Jennifer Widom
Optimization Issues
Query-driven provenance capture
Eager vs. lazy provenance captureSpace-time & query-update tradeoffsProcessing-node dependent
Ex: Dedup, ItemAgg
Retain intermediate data sets?
Fine-grained vs. coarse-grained
Approximate provenance
27
Jennifer Widom
Panda System (version 0.1)
28
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Graphical Interface
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
Jennifer Widom
System: Key Features
Arbitrary acyclic workflows
SQL processing nodesAutomatic provenance capture
Python processing nodesOne-one and one-many transformations only
Backward-tracing and forward-propagation
29
Jennifer Widom
Panda System (version 0.1)
30
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
CreateTable
DataTables(user)
Graphical Interface
Jennifer Widom
Panda System (version 0.1)
31
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
CreateSQL
Transformation
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
SQLTransformations
(user)
DataTables(user)
Graphical Interface
Jennifer Widom
Panda System (version 0.1)
32
SQLite
Panda Layer
Command-line Client
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
File System
CreatePython
Transformation
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
Graphical Interface
Jennifer Widom
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
SQLTransformations
(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
Workflow Table
(Panda)
ProvenancePredicate Tables
(Panda)
DataTables(user)
Forward FilterTables (Panda)
PythonTransformations
(user)
SQLTransformations
(user)
Panda System (version 0.1)
33
SQLite
Panda Layer
Command-line Client
File System
RefreshOutputElement
Graphical Interface
Jennifer Widom
Provenance-Based Refresh
ProblemExploit provenance to efficiently compute the up-to-datevalue of selected output elements
Challenges:Formally defining provenance & refresh
When is new value up-to-date version of old one?Provenance as key
When can selective refresh be done correctly and efficiently?
Properties of transformations, workflows, provenance
Algorithms for wide class of transformations and workflows
34
See recentpaper
See recentpaper