privacy-preserving data quality assessment for high-fidelity data sharing julien freudiger, shantanu...
TRANSCRIPT
![Page 1: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/1.jpg)
Privacy-Preserving Data Quality Assessment for
High-Fidelity Data SharingJulien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun
PARC
![Page 2: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/2.jpg)
2
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
$
![Page 3: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/3.jpg)
3
What about data quality?
Alice does not know data quality prior to acquisition
Dirty data costs US businesses ~$600 billion annually[1]
Data cleaning accounts for up to 80% of development time
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
[1] W. Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute, 2002
80
20Data Cleaning
Data Exploration
![Page 4: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/4.jpg)
4
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
Privacy concerns for Bob
![Page 5: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/5.jpg)
5
All of them
How many rows are
complete?
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
Trust and privacy concerns for Alice
![Page 6: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/6.jpg)
6
ProblemPrivacy-Preserving Data Quality AssessmentPrivacy-Preserving Data Quality Assessment
![Page 7: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/7.jpg)
7
Data Quality MetricsIntegrity constraints on attributes
=, >, [ ], age > 0
Dependency constraints across 2+ attributes if, while, forif state == CA, then ZIP in [94000, 96199]
Many data quality metrics[1,2] CompletenessValidityUniquenessConsistency Timeliness
[1] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & management, 40(2), 2002
[2] P. Cykana, A. Paul, and M. Stern. DoD Guidelines on Data Quality Management. In IQ, pages 154–171, 1996
![Page 8: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/8.jpg)
8
Data Quality Metrics
CompletenessPercentage of elements that are properly populated
Check for values such as NULL, “”,…
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
![Page 9: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/9.jpg)
9
Data Quality Metrics
ValidityPercentage of elements whose attributes possess
meaningful values
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
![Page 10: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/10.jpg)
10
Data Quality Metrics
ConsistencyDegree to which the data attributes satisfy a
dependency constraints
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
![Page 11: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/11.jpg)
11
Desired Privacy Properties
Query PrivacyBob should not learn the data quality constraint
parameters and the resulting values
Data PrivacyAlice should not learn anything from Bob’s data besides
quality metric
![Page 12: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/12.jpg)
12
Application:High-Fidelity Cyber Threat Mitigation
[1] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In IMC, 2005
[2] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX Security, 2008
[3] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006
IP Port Time
UID APT
IP Port Time
UID APT
IP Port Time
UID APT
IP Port Time
UID APT
![Page 13: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/13.jpg)
13
SolutionsRely on existing cryptographic primitives
Develop custom solution
![Page 14: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/14.jpg)
14
Private Set Intersection
Set intersection or cardinality of set intersection
[1] M. Freedman,K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, 2004
[2] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS, 2012
![Page 15: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/15.jpg)
15
Private Set Intersection Completeness
{NULL}
1, NULL2, NULL…n, NULL
1, d1
2, d2
…
n, dn
{d1, …, dn}
PSI-CA approach is inefficient
![Page 16: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/16.jpg)
16
Encrypted-domain Computation
E(d1), E(d2)
E(d1) * E(d2)
d1 + d2
[1] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT, 1999
![Page 17: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/17.jpg)
17
Select & Aggregate Setup
Goal: Alice has a binary selector u, Bob has data vector v. Alice should discover the sum of selected elements from v.Query Privacy: Bob should not find the selector vector.Data Privacy: Alice should not discover any information other than the selected aggregate.
SecureSelect & Aggregat
eProtocol
![Page 18: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/18.jpg)
18
Select & Aggregate Protocol
1. Alice sends element-wise encryptions of u to Bob.2. Bob computes the dot product of u and v using
additive homomorphic property, and sends it to Alice.
3. Alice decrypts the dot product.
SecureSelect & Aggregat
eProtocol
![Page 19: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/19.jpg)
19
Select & Aggregate Complexity
Cannot afford O(#tuples) complexity for large databases.
# Encryptions K 0
# Decryptions 1 0
# Multiplications
0 K
# Exponentiations
0 K
# Transmissions K 1
![Page 20: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/20.jpg)
20
Key Idea1. Find a suitable low-dimensional representation.
2. Use Select & Aggregate to evaluate quality metric.
![Page 21: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/21.jpg)
21
Completeness Evaluation Setup
Example: Alice wants to find the number of NULL values in Bob’s data.Query Privacy: Bob does not discover that Alice is searching for the number of NULLs.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Alice generates a Hashmap, Bob generates a Counting Hashmap.
0...
H(NULL): 1...0
HashMap Counting HashMap
H(b1): 23...
H(NULL): 5...
H(bt): 2
![Page 22: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/22.jpg)
22
Completeness Evaluation Protocol
Alice generates public encryption key and private decryption key for additively homomorphic cryptosystem.The parties evaluate Select & Aggregate on Alice’s Hashmap and Bob’s Counting Hashmap.By construction, protocol reveals number of NULLs to Alice.
0...
H(NULL): 1...0
HashMap Counting HashMap
5
H(b1): 23...
H(NULL): 5...
H(bt): 2
SecureSelect & Aggregat
eProtocol
![Page 23: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/23.jpg)
23
Validity Evaluation Setup
01467201
Histogram of attribute
00011100
Binary vector
Example: Alice wants to know how many of Bob’s entries are in the range [C,E].Query Privacy: Bob does not discover the range of Alice’s searches.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a histogram vector, Alice generates a binary selector vector on the support of the histogram.
AB
CD
E
G
F
Z
![Page 24: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/24.jpg)
24
Validity Evaluation Protocol
As before, Alice and Bob run the Select & Aggregate protocol on Alice’s selector vector and Bob’s histogram.By construction, protocol reveals number of “valid” values to Alice.Protocol works for arbitrary range queries, uniqueness, timeliness.
00011100
01467201
Binary vector Histogram of attribute
15
SecureSelect & Aggregat
eProtocol
AB
CD
E
G
F
Z
![Page 25: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/25.jpg)
25
Consistency Evaluation Setup
Example: Alice wants to know how many of Bob’s entries follow correct dependencies among attributes, e.g., State – Zipcode.Query Privacy: Bob doesn’t discover which dependencies Alice is checking.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a vector of observed associations, Alice generates a vector of desired associations.
10111001
Observeddependencies
11011011
Expecteddependencies
![Page 26: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/26.jpg)
26
Alice and Bob agree upon an ordering of attribute values.They also agree on a vectorization (flattening) pattern.Need to securely compute how many of Bob’s dependencies are consistent with Alice’s rules.
CA MA MN
…
94304
1 0 0 0
55414
0 0 1 0
02139
0 1 0 0
94305
1 0 0 0
…
CA MA MN
…
94304
0 0 1 0
55414
0 0 1 0
02139
0 1 0 0
94305
1 0 0 0
…Desired Dependencies Observed Dependencies
![Page 27: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/27.jpg)
27
Consistency Evaluation Protocol
11011011
10111001
Expecteddependencies
Observeddependencies
4
SecureSelect & Aggregat
eProtocol
Alice and Bob run the Select & Aggregate protocol on Alice’s desired rule vector and Bob’s observed rule vector.Protocol reveals number of “valid” dependencies to Alice.Works for dependencies among arbitrary attribute combinations.
![Page 28: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/28.jpg)
28
Computational Complexity
D R L G
# uniques = # bins = 4
# tuples = 2,306,559
AZ
20
12 v
ote
sMetrics Proposed Protocols Using PSI-CA
Completeness O(# uniques) O(# tuples)
Validity
Timeliness
Uniqueness
O(# histogram bins)
O(# tuples)
Consistency O((# histogram bins)m)
O((# tuples)m)
![Page 29: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC](https://reader035.vdocuments.mx/reader035/viewer/2022062717/56649e545503460f94b4a8c8/html5/thumbnails/29.jpg)
29
Conclusions & Discussion• An important subclass of privacy-preserving data mining.
Precursor to collaboration among untrusting entities.
• Existing protocols, e.g., PSI-CA have high computational overhead.
• Can efficiently evaluate many DQ metrics via homomorphic operations on reduced-dimensionality descriptions.
• Future work:– DQ for non-numeric attributes. – Efficient protocols for testing sparse dependencies.– Extremely difficult: Private evaluation of reliability of
data.
{jfreudig,srane}@parc.com