susan b. davidson u. penn sanjeev khanna u. penn tova milotel-aviv u. debmalya panigrahi mit sudeepa...
TRANSCRIPT
Susan B. Davidson U. Penn
Sanjeev Khanna U. Penn
Tova Milo Tel-Aviv U.
Debmalya Panigrahi MIT
Sudeepa Roy U. Penn
Provenance Views for Module Privacy
Data-oriented Workflows Must Be Secure
Discrete Secure
Ref. Tova Milo’s keynote, PODS 2011
2 Provenance Views for Module Privacy PODS 2011
Provenance Views for Module Privacy
Split Entries
Align Sequences
Functional Data Curate Annotations
Format-2
Format-1
Format-3
Construct Trees
In an execution of the workflow, data (values) appear on the edges
TGCCGTGTGGCTAAAT
CTGTGC
…
CTAAATGTCTGTGC…
GGCTAAATGTCTG
TGCCGTGTGGCGTC…
ATCCGTGTGGCT..
d1
d2d3
d4
d5
d6 d7
3
PODS 2011
Workflows
Vertices = Modules/Progra
ms
Edges = Dataflow
Provenance Views for Module Privacy4
Biologist’s workspace
Which sequences have been used to produce this tree?
How has this tree been generated? ?
s
Split Entries
Align Sequences
Functional Data Curate Annotations
Format
Format
Format
Construct Trees
t
PODS 2011
Need for Provenance
TGCCGTGTGGCTAAAT
CTGTGC
…
CTAAATGTCTGTGC…
GGCTAAATGTCTG
TGCCGTGTGGCGTC…
ATCCGTGTGGCT..
? ??• Enable sharing and
reuse• Ensure repeatability
and debugging
Provenance Views for Module Privacy5
Need for Provenance Need for Privacy
s
Split Entries
Align Sequences
Functional Data Curate Annotations
Format
Format
Format
Construct Trees
t
WorkflowOWNER
WorkflowUSER
How has this result been produced?
All data values
My data is sensitive!
My module is
proprietary!
The flow/structure should
not be revealed!
PODS 2011
…TGCC…ATGGCC
Provenance Views for Module Privacy6
Module Privacy
Module f takes input x, produces output y = f(x)
User should not be able to guess (x, f(x)) pairs with high probability (over any number of executions)
Output value f(x) is private, not the algorithm for fPODS 2011
Module f
x1x2 x3 x4
y1y2 y3
f(x1, x2, x3, x4) = <y1, y2, y3>
Provenance Views for Module Privacy7
Module Privacy: Motivation
Medical Record of patient P
x = x’ =
f(x) = Does P have AIDS?
Process Record
f = Check for AIDS
Check for Cancer
Create Report
report
Does P have cancer?
Patient P’s concern:Whether P hasAIDS should notbe inferred given his medical record
Module owner’s concern: No one should be able to simulate the module anduse it elsewhere
PODS 2011
Provenance Views for Module Privacy8
Module Privacy in a Workflow
Private Modules (no a priori knowledge to the user)o Module for AIDS detection
Public Modules (full knowledge to the user) Sorting, reformatting modules
PODS 2011
a7a6
m1
m2 m3
a1
a3
a2
a4
a5
Data Sharing
n modules are connected as DAG
Private module f, input x, f(x) should not be revealed
Provenance Views for Module Privacy9
Module Privacy with Secure View
Privacy Definition: L-diversity [MGKV’ 06] By hiding some input/output attributes, each x
has L different equivalent possibilities for f(x) Output view is called a ‘Secure-view’
Differential privacy? [Dwork’ 06, DMNS’ 06, …] (Usual) Random noise cannot be added
Scientific experiments must be repeatable Any f should always map any x to the same f(x)
PODS 2011
Provenance Views for Module Privacy10
A view: Projection of R on visible attributes
Privacy parameter Γ (eg. Γ = 2)
Γ-standalone-private View: every input x can be mapped to Γ different outputs by the “possible worlds”
Possible World: Relation that agrees with R on visible attributes (and respects the functional dependency)
y = (x1 x2)y = (x1 ≠
x2)
Standalone Module Privacy
x1 x2
y
Input OutputModule f
PODS 2011
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 1
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 1
Relation R for f
Functional dependency: x1, x2 y
Provenance Views for Module Privacy11
A view: Same as before
Γ-workflow-private view: privacy for each private module as before
Possible world: Relation that agrees with R on visible attributes (and respects ALL functional dependencies)
Workflow Module Privacy
Relation R hasn func. dependencies
a7a6
m1
m2 m3
a1
a3
a2
a4 a5
Workflow W 1. a1, a2 a3, a4, a5
2. a3, a4 a6
3. a4, a5 a7
PODS 2011
a1 a2 a3 a4 a5 a6 a7
0 0 0 0 1 1 00 1 1 1 0 0 11 0 1 1 0 0 11 1 1 0 1 1 1
Provenance Views for Module Privacy12
Secure-View Optimization Problem
Conflicting interests of Owner and User
Hiding each data/attribute has a cost
PODS 2011
User: Provenanc
e
Owner: Privacy
Secure-view problem: Minimize the sum of the cost of the hidden attributes while guaranteeing Γ-workflow-privacy of all private modules
Provenance Views for Module Privacy13
Let’s start with a Single Module
PODS 2011
PROBLEM-1
V (Visible attributes)
V is safe?
PROBLEM-2
V
V is safe? ORACLEA safe subset V*
with minimum cost
PROBLEM-1 Communication Complexity: (N), N = #rows in R
o R is given explicitly Computation Complexity: Co-NP-hard in k = #attributes of R
o R is given succinctly
PROBLEM-2 Communication Complexity: 2(k) oracle calls are needed
How hard is the secure-view problem for a standalone module?
Provenance Views for Module Privacy14
Any Upper Bound?
The trivial brute-force algorithm solves the problem in time O(2kN2) k = #attributes of R, N = #rows of R Can return ALL safe subsets: useful for the next
step
Not so bad: k is not too large for a single module A module is reused in many workflows Expert knowledge from the module designers can
be used to speed up the process
PODS 2011
Provenance Views for Module Privacy15
Moving on to General Workflows
Workflows have Arbitrary data sharing, arbitrary (DAG) connection Interactions between private and public modules
Trivial algorithms are not good Leads to running time = exponential in n
We use the (list of) standalone safe subsets for private modules
First consider: Workflows with all private modules
Two Steps:1. Show that, any combination of safe-subsets for standalone
privacy is also safe for workflow privacy (Composability)
2. Find the minimum cost safe subset for workflow (Optimization)PODS 2011
Provenance Views for Module Privacy16
Composability
Key idea: When a module m is placed in a workflow, and the same attribute subset V is hidden,
#possible worlds shrinks but not #possible outputs of the inputs
Proof involves showing existence of a possible world “All-private workflow” assumption is necessary
PODS 2011
Provenance Views for Module Privacy17
Optimally Combining Standalone Solutions
Any combination of safe subsets works
We want one with minimum cost
Solve the optimization problem for workflow given the list of options for each individual module
The simplest version (no data sharing) is NP-hard
In the paper: Approximation and matching hardness results of
different versions Bounded data sharing has better approximation ratio
PODS 2011
Provenance Views for Module Privacy18
Workflows with Public Modules Public modules are difficult to handle
Composability does not work
f1(x) = y
f2(y) = y Public
Private
Solution: Privatize some public modules Names of “privatized”
modules are not revealed Now composability works
Privatization has an additional cost
Worse approximation results
PODS 2011
Provenance Views for Module Privacy19
Related Work
Workflow privacy (mainly access control) Chebotko et. al. ’08, Gil et. al. ’07, ’10
Secure provenance Braun et. al. ’08, Hasan et. al. ’07, Lyle-Martin ’10
Privacy-preserving data mining Surveys by Aggarwal-Yu ’08, Verykios et. al. ’04
Privacy in statistical databases Survey by Dwork ’08
PODS 2011
Provenance Views for Module Privacy20
Conclusion and Future Work This is a first step to handling module privacy
in a network of modules
Future Directions:1. Explore alternative notion of privacy/partial
background knowledge2. Explore alternative “privatization” techniques for
public modules3. Handle infinite/very large domains of attributes
PODS 2011
Provenance Views for Module Privacy21
Thank You.
Questions?
PODS 2011