data-mining on gbytes of encrypted data · data mining • classification • regression •...

Post on 05-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data-Mining on GBytes of Encrypted DataEncrypted Data

Valeria Nikolaenko

In collaboration withDan Boneh (Stanford); Udi Weinsberg, Stratis Ioannidis,

Marc Joye, Nina Taft (Technicolor).

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

2Valeria Nikolaenko

MotivationUsers

Data

Data mining engineData

Data model

Engine learns nothing more than the model!

3Valeria Nikolaenko

Privacy concern!

Data Mining

• Classification• Regression• Clustering• Summarization

: linear regression

: matrix factorization• Summarization• Dependency modeling

4Valeria Nikolaenko

: matrix factorization

Main challenge: make these algorithms privacy preserving and efficient.

Contribution

• Design of a practical system forprivacy preserving linear regression

• Implementation• Experiments on real datasets• Experiments on real datasets

Comparison to state of the art:• Hall et al.’11: 2 days vs 3 min• Graepel et al.’12: 10 min vs 2 sec

5Valeria Nikolaenko

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

6Valeria Nikolaenko

Computations on Encrypted Data

• 2009, C. Gentry – FHE

• 1979, A. Shamir1988, BGW

(slow for our problems)

Secret sharing1988, BGW

• 1982, A.C. Yao – Garbled circuits

• Our approach: hybrid of Yao and hom. encryption

(huge communication overhead)

7Valeria Nikolaenko

Secret sharing

Yao’s Garbled Circuits

“garbled

C(x) C(x)

8Valeria Nikolaenko

circuit“garbled circuit”

01

01

01

...

...01

G01

G11

G02

G12

G03

G13

...

...G0

nG1

n

x=010...1 Nothing is leaked about x,other than C(x)!

Data Mining System Architecture

User Evaluator

Crypto Service Provider (CSP)User i

CreateĈ

9Valeria Nikolaenko

Create“garbled circuit” ĈĈ

{Gx11, Gx2

2, …, Gxnn }

Compute C(x1, …, xn)

C(x1, …, xn)

Ĉ

{G01, G0

2, …, G0n}

{G11, G1

2, …, G1n}

xi

System Properties

�Evaluator learns the model, not the inputs

Problems:• Not scalable with the number of users• Not scalable with the number of users• Users need to be online

10Valeria Nikolaenko

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

11Valeria Nikolaenko

Linear Regression I

• (f,r) – from users• solve for β

bA =βr

12Valeria Nikolaenko

bA =β

f

r = <f, β>

Linear Regression II

Users

f1, r1Training engine… Training engine

fn, rn

Training engine learns nothing about f’s and r’s,other than β!

β

13Valeria Nikolaenko

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

14Valeria Nikolaenko

Construct Matrices

Users

f1Tf1

Evaluator

computes A = ∑fiTfi

in the circuit

Phase1: compute and ∑= iT

i ffA ∑= iT

i rfb

15Valeria Nikolaenko

fnTfn

For additions can use homomorphic encryption:

)()](),([ 2121 AAHEAHEAHE +→

in the circuit

Construct Matrices

Users

EvaluatorHE(f1

Tf1)

• computesHE(A) = HE(∑f Tf )

Phase1: compute and ∑= iT

i ffA ∑= iT

i rfb

16Valeria Nikolaenko

For additions can use homomorphic encryption:

)()](),([ 2121 AAHEAHEAHE +→

HE(fnTfn)

HE(A) = HE(∑fiTfi )

• decrypts in the circuit

Circuit – independent of the number of usersUsers – offline

Solve Linear System

Input: HE(A),HE(b)

Decrypt A and b

bA =βPhase2: solve for β

Cholesky: compute L, s.t. A= LLT

Solve for β: L(LT β)=b

17Valeria Nikolaenko

β

Privacy Preserving Regression System

Evaluator Crypto Service Provider (CSP)

HEpk(f iTf i)

pk

(pk, sk)←HE.KeyGen

Create “garbled circuit” Ĉ

ComputeHEpk(A = ∑fiTfi),HEpk(b = ∑fiTri)

HEpk(f iTr i)

pk

pkPhase1

{G01, G0

2, …, G0n}

{G11, G1

2, …, G1n}

18Valeria Nikolaenko

Send ĈĈHEpk(f i

Tr i)

β

Garbled inputs

ComputeC(HEpk(A),HEpk(b))

Phase2

System Properties

�Evaluator learns the model, not the inputs�Scalable with the number of users�Users can be offline

19Valeria Nikolaenko

Extensions:• Masking instead of decryption in circuit• Protection against malicious

Evaluator and CSP

Outline

• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance

20Valeria Nikolaenko

Performance• Hybrid vs. Pure-Yao: 100 times improvement in time!

• For up to 20 features, 1000’s users• time < 3 min• communication < 1GB

21Valeria Nikolaenko

• communication < 1GB

• For 100 million of users, 20 features: 8.75 hours

• Tested on real datasets

Conclusion

• Privacy-preserving data-mining is efficient• Our approach can be used in practice

• Current work: matrix factorization

22Valeria Nikolaenko

• Current work: matrix factorization

• Future work:– implement other data mining algorithms– improving implementation to support high

parallelization

Thank you!

23Valeria Nikolaenko

Questions? valerini@stanford.edu

top related