data-mining on gbytes of encrypted data · data mining • classification • regression •...

23
Data-Mining on GBytes of Encrypted Data Valeria Nikolaenko In collaboration with Dan Boneh (Stanford); Udi Weinsberg, Stratis Ioannidis, Marc Joye, Nina Taft (Technicolor).

Upload: others

Post on 05-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Data-Mining on GBytes of Encrypted DataEncrypted Data

Valeria Nikolaenko

In collaboration withDan Boneh (Stanford); Udi Weinsberg, Stratis Ioannidis,

Marc Joye, Nina Taft (Technicolor).

Page 2: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

2Valeria Nikolaenko

Page 3: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

MotivationUsers

Data

Data mining engineData

Data model

Engine learns nothing more than the model!

3Valeria Nikolaenko

Privacy concern!

Page 4: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Data Mining

• Classification• Regression• Clustering• Summarization

: linear regression

: matrix factorization• Summarization• Dependency modeling

4Valeria Nikolaenko

: matrix factorization

Main challenge: make these algorithms privacy preserving and efficient.

Page 5: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Contribution

• Design of a practical system forprivacy preserving linear regression

• Implementation• Experiments on real datasets• Experiments on real datasets

Comparison to state of the art:• Hall et al.’11: 2 days vs 3 min• Graepel et al.’12: 10 min vs 2 sec

5Valeria Nikolaenko

Page 6: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

6Valeria Nikolaenko

Page 7: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Computations on Encrypted Data

• 2009, C. Gentry – FHE

• 1979, A. Shamir1988, BGW

(slow for our problems)

Secret sharing1988, BGW

• 1982, A.C. Yao – Garbled circuits

• Our approach: hybrid of Yao and hom. encryption

(huge communication overhead)

7Valeria Nikolaenko

Secret sharing

Page 8: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Yao’s Garbled Circuits

“garbled

C(x) C(x)

8Valeria Nikolaenko

circuit“garbled circuit”

01

01

01

...

...01

G01

G11

G02

G12

G03

G13

...

...G0

nG1

n

x=010...1 Nothing is leaked about x,other than C(x)!

Page 9: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Data Mining System Architecture

User Evaluator

Crypto Service Provider (CSP)User i

CreateĈ

9Valeria Nikolaenko

Create“garbled circuit” ĈĈ

{Gx11, Gx2

2, …, Gxnn }

Compute C(x1, …, xn)

C(x1, …, xn)

Ĉ

{G01, G0

2, …, G0n}

{G11, G1

2, …, G1n}

xi

Page 10: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

System Properties

�Evaluator learns the model, not the inputs

Problems:• Not scalable with the number of users• Not scalable with the number of users• Users need to be online

10Valeria Nikolaenko

Page 11: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

11Valeria Nikolaenko

Page 12: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Linear Regression I

• (f,r) – from users• solve for β

bA =βr

12Valeria Nikolaenko

bA =β

f

r = <f, β>

Page 13: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Linear Regression II

Users

f1, r1Training engine… Training engine

fn, rn

Training engine learns nothing about f’s and r’s,other than β!

β

13Valeria Nikolaenko

Page 14: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

14Valeria Nikolaenko

Page 15: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Construct Matrices

Users

f1Tf1

Evaluator

computes A = ∑fiTfi

in the circuit

Phase1: compute and ∑= iT

i ffA ∑= iT

i rfb

15Valeria Nikolaenko

fnTfn

For additions can use homomorphic encryption:

)()](),([ 2121 AAHEAHEAHE +→

in the circuit

Page 16: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Construct Matrices

Users

EvaluatorHE(f1

Tf1)

• computesHE(A) = HE(∑f Tf )

Phase1: compute and ∑= iT

i ffA ∑= iT

i rfb

16Valeria Nikolaenko

For additions can use homomorphic encryption:

)()](),([ 2121 AAHEAHEAHE +→

HE(fnTfn)

HE(A) = HE(∑fiTfi )

• decrypts in the circuit

Circuit – independent of the number of usersUsers – offline

Page 17: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Solve Linear System

Input: HE(A),HE(b)

Decrypt A and b

bA =βPhase2: solve for β

Cholesky: compute L, s.t. A= LLT

Solve for β: L(LT β)=b

17Valeria Nikolaenko

β

Page 18: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Privacy Preserving Regression System

Evaluator Crypto Service Provider (CSP)

HEpk(f iTf i)

pk

(pk, sk)←HE.KeyGen

Create “garbled circuit” Ĉ

ComputeHEpk(A = ∑fiTfi),HEpk(b = ∑fiTri)

HEpk(f iTr i)

pk

pkPhase1

{G01, G0

2, …, G0n}

{G11, G1

2, …, G1n}

18Valeria Nikolaenko

Send ĈĈHEpk(f i

Tr i)

β

Garbled inputs

ComputeC(HEpk(A),HEpk(b))

Phase2

Page 19: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

System Properties

�Evaluator learns the model, not the inputs�Scalable with the number of users�Users can be offline

19Valeria Nikolaenko

Extensions:• Masking instead of decryption in circuit• Protection against malicious

Evaluator and CSP

Page 20: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

20Valeria Nikolaenko

Page 21: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Performance• Hybrid vs. Pure-Yao: 100 times improvement in time!

• For up to 20 features, 1000’s users• time < 3 min• communication < 1GB

21Valeria Nikolaenko

• communication < 1GB

• For 100 million of users, 20 features: 8.75 hours

• Tested on real datasets

Page 22: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Conclusion

• Privacy-preserving data-mining is efficient• Our approach can be used in practice

• Current work: matrix factorization

22Valeria Nikolaenko

• Current work: matrix factorization

• Future work:– implement other data mining algorithms– improving implementation to support high

parallelization

Page 23: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling

Thank you!

23Valeria Nikolaenko

Questions? [email protected]