privacy preserving data mining – introduction and ...lxiong/cs573_s12/share/slides/09ppdm.pdf ·...

Privacy preserving data mining –

Introduction and randomization

techniques

Li X

ion

g

CS

57

3 D

ata

Pri

va

cy a

nd

An

on

ym

ity

What Is Data Mining?

�D

ata

min

ing (

know

ledge d

iscovery

fro

m d

ata

)

�E

xtr

action o

f in

tere

sting (non-t

rivia

l,im

plic

it,

pre

vio

usly

unknow

nand p

ote

ntially

usefu

l)patt

ern

s

or

know

ledg

e f

rom

huge a

mount

of data

Know

ledge d

iscovery

in d

ata

bases (

KD

D),

February 12, 2009

2

�K

now

ledge d

iscovery

in d

ata

bases (

KD

D),

know

ledge e

xtr

action,

data

/patt

ern

analy

sis

,

info

rmation h

arv

esting,

busin

ess inte

llige

nce

Privacy preserving data mining

�S

up

po

rt d

ata

min

ing

wh

ile p

rese

rvin

g p

riva

cy

�S

en

sitiv

e r

aw

da

ta

�S

en

sitiv

e m

inin

g r

esu

lts

Sem

inal work

�P

rivacy p

reserv

ing d

ata

min

ing, A

gra

wal and S

rikant,

2000

�C

en

tra

lize

d d

ata

�D

ata

ra

nd

om

iza

tio

n (

ad

ditiv

e n

ois

e)

�D

ecis

ion

tre

e c

lassifie

r

February 12, 2009

4

�D

ecis

ion

tre

e c

lassifie

r

�P

rivacy p

reserv

ing d

ata

min

ing,

Lin

dell

and P

inkas,

2000

�D

istr

ibu

ted

da

ta m

inin

g

�S

ecu

re m

ulti-

pa

rty c

om

pu

tatio

n

�D

ecis

ion

tre

e c

lassifie

r

Taxonomy of PPDM algorithms

�D

ata

dis

trib

ution

�C

en

tra

lize

d

�D

istr

ibu

ted

–P

riva

cy p

rese

rvin

g d

istr

ibu

ted

da

ta m

inin

g

�A

ppro

aches

Inp

ut p

ert

urb

atio

n –

ad

ditiv

e n

ois

e (

ran

do

miz

atio

n),

February 12, 2009

5

�In

pu

t p

ert

urb

atio

n –

ad

ditiv

e n

ois

e (

ran

do

miz

atio

n),

mu

ltip

lica

tive

no

ise

, g

en

era

liza

tio

n, sw

ap

pin

g,

sa

mp

ling

�O

utp

ut

pe

rtu

rba

tio

n –

rule

hid

ing

�C

ryp

to t

ech

niq

ue

s –

se

cu

re m

ultip

art

y c

om

pu

tatio

n

�D

ata

min

ing a

lgorith

ms

�C

lassific

atio

n

�A

sso

cia

tio

n r

ule

min

ing

�C

luste

rin

g

Input Perturbation

�R

eve

al e

ntire

da

tab

ase

, b

ut

ran

do

miz

e e

ntr

ies

Da

tab

ase

Use

r

slid

e 6

x1

5 xn

x1+ε 1

…xn+ε n

Add r

andom

nois

e ε

ito

each d

ata

base e

ntr

y x

i

Fo

r e

xa

mp

le,

if d

istr

ibu

tio

n o

f n

ois

e h

as

me

an

0,

use

r ca

n c

om

pu

te a

ve

rag

e o

f x

i

Randomization techniques

�P

riva

cy p

rese

rvin

g d

ata

min

ing

, A

gra

wa

la

nd

Sri

ka

nt,

20

00

�S

em

ina

l w

ork

on

de

cis

ion

tre

e c

lassifie

r

�L

imitin

g P

riva

cy B

rea

ch

es in

Pri

va

cy-

�L

imitin

g P

riva

cy B

rea

ch

es in

Pri

va

cy-

Pre

se

rvin

g D

ata

Min

ing

, E

vfim

ievskia

nd

Ge

hrk

e, 2

00

3

�R

efin

ed

priva

cy d

efin

itio

n

�A

sso

cia

tio

n r

ule

min

ing

Randomization Based Decision Tree Learning

(Agrawal and Srikant ’00)

�B

asic

id

ea

: P

ert

urb

Da

ta w

ith

Va

lue

Dis

tort

ion

�U

ser

pro

vid

es x i+r

inste

ad o

f x i

�r

is a

random

valu

e�

Un

ifo

rm, u

nifo

rm d

istr

ibu

tio

n b

etw

ee

n [-α, α]

�G

au

ssia

n, n

orm

al d

istr

ibu

tio

n w

ith

µ= 0, σ

�H

yp

oth

esis

�H

yp

oth

esis

�M

iner

doesn’t s

ee t

he r

eal data

or

can’t

reconstr

uct re

al valu

es

�M

iner

can r

econstr

uct enough info

rmation t

o b

uild

decis

ion t

ree for

cla

ssific

ation

Randomization Approach

50

| 4

0K

| .

..3

0 | 7

0K

| .

..

...

Ra

nd

om

ize

rR

an

do

miz

er

65

| 2

0K

| .

..

25

| 6

0K

| .

....

.

Alic

e’s

age

Add r

andom

num

ber

to

Age

...

Cla

ssific

atio

n

Alg

ori

thm

Mo

de

l

65

| 2

0K

| .

..

25

| 6

0K

| .

....

.

30

becom

es

65

(30+

35)

?

�C

lassific

ation

�pre

dic

ts c

ate

gorical cla

ss labels

(dis

cre

te o

r nom

inal)

�P

redic

tion (

Regre

ssio

n)

�m

odels

continuous-v

alu

ed f

unctions,

i.e.,

pre

dic

ts

unknow

n o

r m

issin

g v

alu

es

Classification

Fe

bru

ary

12

, 2

00

81

0

�Typic

al applic

ations

�C

redit a

ppro

val

�Targ

et m

ark

eting

�M

edic

al dia

gnosis

�F

raud d

ete

ction

Motivating Example for Classification –Fruit

Identification

Dangero

us

Soft

Red

Sm

ooth

Safe

Hard

Larg

eG

reen

Hairy

safe

Hard

Larg

eB

row

nH

airy

Conclu

sio

nF

lesh

Siz

eC

olo

rS

kin

Larg

e

Li X

ion

g11

11

5

Dangero

us

Hard

Sm

all

Sm

ooth

Safe

Soft

Larg

eG

reen

Hairy

Red

Another Example –Credit Approval

Nam

eA

ge

Incom

e5

Cre

dit

Cla

rk35

Hig

h5

Excelle

nt

Milt

on

38

Hig

h5

Excelle

nt

Neo

25

Mediu

m5

Fair

55

55

5

Fe

bru

ary

12

, 2

00

81

2

�C

lassific

atio

n r

ule

:

�If a

ge

= “

31

...4

0”

an

d in

co

me

= h

igh

th

en

cre

dit_

ratin

g =

exce

llen

t

�F

utu

re c

usto

me

rs

�P

au

l: a

ge

= 3

5, in

co

me

= h

igh

⇒e

xce

llen

t cre

dit r

atin

g

�Jo

hn

: a

ge

= 2

0, in

co

me

= m

ed

ium

⇒fa

ir c

red

it r

atin

g

55

55

5

Classification—A Two-Step Process

�M

odel constr

uction:

describin

g a

set of pre

dete

rmin

ed

cla

sses

�E

ach tuple

/sam

ple

is a

ssum

ed t

o b

elo

ng t

o a

pre

defined c

lass,

as d

ete

rmin

ed b

y the c

lass label

att

ribute

The s

et of tu

ple

s u

sed f

or

model constr

uction is

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s1

3

�T

he s

et of tu

ple

s u

sed f

or

model constr

uction is

train

ing s

et

�T

he m

odel is

repre

sente

d a

s c

lassific

ation r

ule

s,

decis

ion t

rees, or

math

em

atical fo

rmula

e

�M

odel usage:

for

cla

ssifyin

g f

utu

re o

r unknow

n o

bje

cts

Training Dataset

ag

ein

co

me

stu

de

nt

cre

dit_

ratin

gb

uys_

co

mp

ute

r

<=

30

hig

hn

ofa

irn

o

<=

30

hig

hn

oe

xce

lle

nt

no

31

54

0h

igh

no

fair

ye

s

>4

0m

ed

ium

no

fair

ye

s

>4

0lo

wye

sfa

irye

s

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s1

4

>4

0lo

wye

sfa

irye

s

>4

0lo

wye

se

xce

lle

nt

no

31

54

0lo

wye

se

xce

lle

nt

ye

s

<=

30

me

diu

mn

ofa

irn

o

<=

30

low

ye

sfa

irye

s

>4

0m

ed

ium

ye

sfa

irye

s

<=

30

me

diu

mye

se

xce

lle

nt

ye

s

31

54

0m

ed

ium

no

exc

elle

nt

ye

s

31

54

0h

igh

ye

sfa

irye

s

>4

0m

ed

ium

no

exc

elle

nt

no

Output: A Decision Tree for “buys_computer”

age?

overcast

<=

30

>40

31

..4

0

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s1

5

student?

credit rating?

no

yes

yes

yes

fair

excellent

yes

no

Algorithm for Decision Tree Induction

�ID

3 (

Itera

tive D

ichoto

mis

er)

, C

4.5

�C

AR

T (

Cla

ssific

ation a

nd R

egre

ssio

n T

rees)

�B

asic

alg

orith

m (

a g

reedy a

lgorith

m)

-tr

ee is c

onstr

ucte

d in a

top-d

ow

n

recurs

ive d

ivid

e-a

nd-c

onquer

manner

�A

t sta

rt, all

the t

rain

ing e

xam

ple

s a

re a

t th

e r

oot

�A

test

att

ribute

is s

ele

cte

d t

hat

“best”

separa

te t

he d

ata

into

part

itio

ns

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s1

6

�A

test

att

ribute

is s

ele

cte

d t

hat

“best”

separa

te t

he d

ata

into

part

itio

ns

�H

eu

ristic o

r sta

tistica

l m

ea

su

re

�S

am

ple

s a

re p

art

itio

ned r

ecurs

ively

based o

n s

ele

cte

d a

ttribute

s

�C

onditio

ns f

or

sto

ppin

g p

art

itio

nin

g

�A

ll sam

ple

s f

or

a g

iven n

ode b

elo

ng t

o the s

am

e c

lass

�T

here

are

no r

em

ain

ing a

ttribute

s f

or

furt

her

part

itio

nin

g –

majo

rity

voting

is e

mplo

yed f

or

cla

ssifyin

g t

he leaf

�T

here

are

no s

am

ple

s left

Attribute Selection Measures

�Id

ea

: se

lect a

ttri

bu

te th

at p

art

itio

n s

am

ple

s in

to h

om

og

en

eo

us g

rou

ps

�M

ea

su

res

�In

form

atio

n g

ain

(ID

3)

�G

ain

ra

tio

(C

4.5

)

�G

ini in

de

x (

CA

RT

)

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s1

7

Attribute Selection Measure: Information

Gain (ID3)

�S

ele

ct th

e a

ttri

bu

te w

ith

th

e h

igh

est in

form

atio

n g

ain

�L

et pib

e th

e p

rob

ab

ility

th

at a

n a

rbitra

ry tu

ple

in

D b

elo

ng

s to

cla

ss C

i,

estim

ate

d b

y |Ci, D

|/|D

|

�E

xp

ecte

d in

form

atio

n(e

ntr

op

y)

ne

ed

ed

to

cla

ssify a

tu

ple

in

D:

)(

log

)(

2

1

i

m i

ip

pD

Info

∑ =

−=

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s1

8

�In

form

atio

nn

ee

de

d (

afte

r u

sin

g A

to

sp

lit D

in

to v

pa

rtitio

ns)

to c

lassify

D:

�In

form

atio

n g

ain

–d

iffe

ren

ce

be

twe

en

ori

gin

al in

form

atio

n r

eq

uir

em

en

t

an

d th

e n

ew

in

form

atio

n r

eq

uir

em

en

t b

y b

ran

ch

ing

on

attri

bu

te A

1i=

)(

||

||

)(

1

j

v j

j

AD

Info

DDD

Info

×=∑ =

(D)

Info

Info(D)

Gain(A)

A−

=

Attribute Selection Measure: Gini index (CART)

�If a

da

ta s

et D c

on

tain

s e

xa

mp

les fro

m n

cla

sse

s, g

ini in

de

x, gini(D

) is

de

fin

ed

as

wh

ere

pjis

th

e r

ela

tive

fre

qu

en

cy o

f cla

ss j

in D

�If a

da

ta s

et D

is s

plit

on

A in

to tw

o s

ub

se

ts D1

an

d D2, th

e giniin

de

x

∑ =−

=n j

pj

Dgini

1

21

)(

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s1

9

12

gini(D

) is

de

fin

ed

as

�R

ed

uctio

n in

Im

pu

rity

:

�T

he

attri

bu

te p

rovid

es th

e s

ma

llest gini split(D

) (o

r th

e la

rge

st re

du

ctio

n

in im

pu

rity

) is

ch

ose

n to

sp

lit th

e n

od

e

)(

||

||

)(

||

||

)(

22

11

Dgini

DDD

gini

DDD

giniA

+=

)(

)(

)(

Dgini

Dgini

Agini

A−

=∆

Information-Gain for Continuous-Value

Attributes

�Let att

ribute

A b

e a

continuous-v

alu

ed a

ttribute

�M

ust dete

rmin

e t

he best split point

for A

�S

ort

the v

alu

e A

in incre

asin

g o

rder

�Typic

ally

, th

e m

idpoin

t betw

een e

ach p

air o

f adja

cent

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s2

0

valu

es is c

onsid

ere

d a

s a

possib

le split point

�(a

i+a

i+1)/

2 is th

e m

idp

oin

t b

etw

ee

n th

e v

alu

es o

f a

ia

nd

ai+

1

�T

he p

oin

t w

ith t

he minimum expected information

requirement

for A

is s

ele

cte

d a

s t

he s

plit

-poin

t fo

r A

�S

plit

:

�D

1 is t

he s

et of tu

ple

s in D

satisfy

ing A

≤ s

plit

-poin

t, a

nd

D2 is t

he s

et of tu

ple

s in D

satisfy

ing A

> s

plit

-poin

t

Randomization Approach

50

| 4

0K

| .

..3

0 | 7

0K

| .

..

...

Ra

nd

om

ize

rR

an

do

miz

er

65

| 2

0K

| .

..

25

| 6

0K

| .

....

.

Alic

e’s

age

Add r

andom

num

ber

to

Age

...

Cla

ssific

atio

n

Alg

ori

thm

Mo

de

l

65

| 2

0K

| .

..

25

| 6

0K

| .

....

.

30

becom

es

65

(30+

35)

?

Attribute Selection Measure: Gini index (CART)

�If a

da

ta s

et D c

on

tain

s e

xa

mp

les fro

m n

cla

sse

s, g

ini in

de

x, gini(D

) is

de

fin

ed

as

wh

ere

pjis

th

e r

ela

tive

fre

qu

en

cy o

f cla

ss j

in D

�If a

da

ta s

et D

is s

plit

on

A in

to tw

o s

ub

se

ts D1

an

d D2, th

e giniin

de

x

∑ =−

=n j

pj

Dgini

1

21

)(

Fe

bru

ary

12

, 2

00

8D

ata

Min

ing: C

on

ce

pts

an

d T

ech

niq

ue

s2

2

12

gini(D

) is

de

fin

ed

as

�R

ed

uctio

n in

Im

pu

rity

:

�T

he

attri

bu

te p

rovid

es th

e s

ma

llest gini split(D

) (o

r th

e la

rge

st re

du

ctio

n

in im

pu

rity

) is

ch

ose

n to

sp

lit th

e n

od

e

)(

||

||

)(

||

||

)(

22

11

Dgini

DDD

gini

DDD

giniA

+=

)(

)(

)(

Dgini

Dgini

Agini

A−

=∆

Randomization Approach Overview

50

| 4

0K

| .

..3

0 | 7

0K

| .

..

...

Ra

nd

om

ize

rR

an

do

miz

er

65

| 2

0K

| .

..

25

| 6

0K

| .

....

.

Alic

e’s

age

Add r

andom

num

ber

to

Age

...

Re

co

nstr

uct

Dis

trib

utio

n

of A

ge

Re

co

nstr

uct

Dis

trib

utio

n

of

Sa

lary

Cla

ssific

atio

n

Alg

ori

thm

Mo

de

l

65

| 2

0K

| .

..

25

| 6

0K

| .

....

.

30

becom

es

65

(30+

35)

Original Distribution Reconstruction

�x 1, x 2, …, x n

are

the n

origin

al data

valu

es

�D

raw

n fro

m n

iid r

an

do

m v

ari

ab

les X

1, X2, …, Xn

sim

ilar

to X

�U

sin

g value distortion,

�T

he

giv

en

va

lue

s a

re w

1= x1 + y1, w2= x2 + y2, …, wn= xn+ yn

�y i

’s a

re fro

m n

iid r

an

do

m v

ari

ab

les Y

1, Y2, …, Yn

sim

ilar

to Y

�y i

’s a

re fro

m n

iid r

an

do

m v

ari

ab

les Y

1, Y2, …, Yn

sim

ilar

to Y

�R

econstr

uction P

roble

m:

�G

ive

n FY

an

d wi’s

, e

stim

ate

FX

Original Distribution Reconstruction: Method

�B

ayes’ t

heore

m for

continuous d

istr

ibution

�T

he e

stim

ate

d d

ensity f

unction:

()

()

()

∑−

=′

nX

iY

af

aw

fa

f1

�It

era

tive e

stim

ation

�T

he

in

itia

l e

stim

ate

fo

r f X

at j=0:

un

ifo

rm d

istr

ibu

tio

n

�Ite

rative

estim

atio

n

�S

top

pin

g C

rite

rio

n: χ2

test betw

een s

uccessiv

e ite

rations

()

()

()

()

()

∑∫

=∞ ∞−

−=

′i

Xi

Y

Xi

YX

dz

zf

zw

fn

af

1

1

()

()

()

()

()

∑∫

=∞ ∞−

+

−−=

n ij

Xi

Y

j

Xi

Yj

X

dz

zf

zw

f

af

aw

f

na

f1

11

Reconstruction of Distribution

80

0

10

00

12

00

�umber of People

Ori

gin

al

0

20

0

40

0

60

0

20

60

Ag

e

�umber of People

Ori

gin

al

Ran

do

miz

ed

Rec

on

stru

cted

Original Distribution Reconstruction

Original Distribution Construction for

Decision Tree �

Wh

en

to

re

co

nstr

uct d

istr

ibu

tio

ns?

�G

lobal

�R

eco

nstr

uct fo

r e

ach

attri

bu

te o

nce

at th

e

be

gin

nin

g

�B

uild

th

e d

ecis

ion

tre

e u

sin

g th

e

reco

nstr

ucte

d d

ata

reco

nstr

ucte

d d

ata

�B

yC

lass

�F

irst sp

lit th

e tra

inin

g d

ata

�R

eco

nstr

uct fo

r e

ach

cla

ss s

ep

ara

tely

�B

uild

th

e d

ecis

ion

tre

e u

sin

g th

e

reco

nstr

ucte

d d

ata

�Local

�F

irst sp

lit th

e tra

inin

g d

ata

�R

eco

nstr

uct fo

r e

ach

cla

ss s

ep

ara

tely

�R

eco

nstr

uct a

t e

ach

no

de

wh

ile b

uild

ing

th

e

tre

e

Accuracy vs. Randomization Level

Fn

3

90

10

0

40

50

60

70

80

10

20

40

60

80

10

01

50

20

0

Ran

do

miz

ati

on

Lev

el

Accuracy

Ori

gin

al

Ran

do

miz

ed

ByC

lass

More Results

�G

lob

al p

erf

orm

s w

ors

e th

an

ByC

lass a

nd

Lo

ca

l

�B

yC

lass a

nd

Lo

ca

l h

ave

accu

racy w

ith

in 5

% to

15

% (

ab

so

lute

e

rro

r) o

f th

e O

rig

ina

l a

ccu

racy

�O

ve

rall,

all

are

mu

ch

be

tte

r th

an

th

e R

an

do

miz

ed

accu

racy

Privacy level

Is the privacy level sufficiently

measured?

How to Measure Privacy Breach

�W

ea

k: n

o s

ing

le d

ata

ba

se

en

try h

as b

ee

n

reve

ale

d

�S

tro

ng

er:

no

sin

gle

pie

ce

of in

form

atio

n is

reve

ale

d(w

ha

t’s t

he

diffe

ren

ce

fro

m t

he

slid

e 3

2

reve

ale

d(w

ha

t’s t

he

diffe

ren

ce

fro

m t

he

“we

ak”

ve

rsio

n?

)

�S

tro

ng

est:

th

e a

dve

rsa

ry’s

be

liefs

ab

ou

t th

e

da

ta h

ave

no

t ch

an

ge

d

Kullback-Leibler Distance

�M

ea

su

res t

he

“d

iffe

ren

ce

” b

etw

ee

n t

wo

pro

ba

bili

ty d

istr

ibu

tio

ns

slid

e 3

3

Privacy of Input Perturbation

�X

is a

ra

nd

om

va

ria

ble

, R

is th

e r

an

do

miz

atio

n

op

era

tor,

Y=

R(X

) is

th

e p

ert

urb

ed

da

tab

ase

�M

ea

su

re m

utu

al in

form

atio

n b

etw

ee

n o

rig

ina

l

an

d r

an

do

miz

ed

da

tab

ase

s

slid

e 3

4

an

d r

an

do

miz

ed

da

tab

ase

s

�A

ve

rag

e K

L d

ista

nce

be

twe

en

(1

) d

istr

ibu

tio

n o

f X

an

d (

2)

dis

trib

utio

n o

f X

co

nd

itio

ne

d o

n Y

=y

�E

y (K

L(P

X|Y

=y

|| P

x))

�In

tuitio

n: if th

is d

ista

nce

is s

ma

ll, th

en

Y le

aks little

info

rmatio

n a

bo

ut a

ctu

al va

lue

s o

f X

�W

hy is t

his

de

fin

itio

n p

rob

lem

atic?

Is the randomization sufficient?

Gla

dys:

85

Doris:

90

Na

me

: A

ge

da

tab

ase

Gla

dys:

72

Doris: 110

Bery

l:

85

Age is a

n inte

ger

betw

een 0

and 9

0

slid

e 3

5

Gla

dys:

85

Doris:

90

Bery

l:

82

Random

ize d

ata

base e

ntr

ies

by a

ddin

g r

andom

inte

gers

betw

een -

20 a

nd 2

0

Random

ization o

pera

tor

has to b

e p

ublic

(w

hy?)

Doris’s

age is 9

0!!

Privacy Definitions

�M

utu

al in

form

atio

n c

an

be

sm

all

on

ave

rag

e,

bu

t

an

in

div

idu

al ra

nd

om

ize

d v

alu

e c

an

still

lea

k a

lo

t

of in

form

atio

n a

bo

ut

the

ori

gin

al va

lue

�B

ett

er:

co

nsid

er

so

me

pro

pe

rty Q

(x)

slid

e 3

6

Be

tte

r: c

on

sid

er

so

me

pro

pe

rty Q

(x)

�A

dve

rsa

ry h

as a

prio

ri p

rob

ab

ility

Pith

at

Q(x

i) is

tru

e

�P

riva

cy b

rea

ch

if

reve

alin

g y

i=R

(xi) s

ign

ific

an

tly

ch

an

ge

s a

dve

rsa

ry’s

pro

ba

bili

ty t

ha

t Q

(xi) is tru

e

�In

tuitio

n: a

dve

rsa

ry le

arn

ed

so

me

thin

g a

bo

ut e

ntr

y

xi(n

am

ely

, lik

elih

oo

d o

f p

rop

ert

y Q

ho

ldin

g fo

r th

is

en

try)

Example

�D

ata

: 0≤x≤1

00

0,

p(x

=0

)=0

.01

, p

(x=

k)=

0.0

00

99

�R

eve

al y=

R(x

)

�T

hre

e p

ossib

le r

an

do

miz

atio

n o

pe

rato

rs R

�R

1(x

) =

xw

ith

pro

b. 2

0%

; a

un

ifo

rmly

ran

do

m

slid

e 3

7

�R

1(x

) =

xw

ith

pro

b. 2

0%

; a

un

ifo

rmly

ran

do

m

nu

mb

er

with

pro

b. 8

0%

�R

2(x

) =

x+ξ

mo

d 1

00

1, ξ

un

ifo

rm in

[-1

00

,10

0]

�R

3(x

) =

R2(x

)w

ith

pro

b. 5

0%

, a

un

ifo

rmra

nd

om

nu

mb

er

with

pro

b. 5

0%

�W

hic

h r

an

do

miz

atio

n o

pe

rato

r is

be

tte

r?

Some Properties

�Q

1(x

): x

=0

; Q

2(x

): x∉

{20

0,

...,

80

0}

�W

ha

t a

re t

he

a p

rio

ri p

rob

ab

ilitie

s f

or

a g

ive

n x

tha

t th

ese

pro

pe

rtie

s h

old

?

�Q

1(x

): 1

%, Q

2(x

): 4

0.5

%

slid

e 3

8

�Q

1(x

): 1

%, Q

2(x

): 4

0.5

%

�N

ow

su

pp

ose

ad

ve

rsa

ry le

arn

ed

th

at y=

R(x

)=0

.

Wh

at a

re p

rob

ab

ilitie

s o

f Q

1(x

) a

nd

Q2(x

)?

�If R

= R

1th

en

Q1(x

): 7

1.6

%,

Q2(x

): 8

3%

�If R

= R

2th

en

Q1(x

): 4

.8%

, Q

2(x

): 1

00

%

�If R

= R

3th

en

Q1(x

): 2

.9%

, Q

2(x

): 7

0.8

%

Privacy Breaches

�R

1(x

) le

aks in

form

atio

n a

bo

ut

pro

pe

rty Q

1(x

)

�B

efo

re s

ee

ing

R1(x

), a

dve

rsa

ry th

inks th

at

pro

ba

bili

ty o

f x=

0 is o

nly

1%

, b

ut a

fte

r n

oticin

g th

at

R1(x

)=0

, th

e p

rob

ab

ility

th

at

x=

0 is 7

2%

slid

e 3

9

�R

2(x

) le

aks in

form

atio

n a

bo

ut

pro

pe

rty Q

2(x

)

�B

efo

re s

ee

ing

R2(x

), a

dve

rsa

ry th

inks th

at

pro

ba

bili

ty o

f x∉

{20

0, ..., 8

00

} is

41

%, b

ut

afte

r

no

ticin

g th

at

R2(x

)=0

, th

e p

rob

ab

ility

th

at

x∉

{20

0,

..., 8

00

} is

10

0%

�R

an

do

miz

atio

n o

pe

rato

r sh

ou

ld b

e s

uch

th

at

po

ste

rio

r d

istr

ibu

tio

n is c

lose

to

th

e p

rio

r

dis

trib

utio

n fo

r a

ny

pro

pe

rty

Privacy Breach: Definitions

�Q

(x)

is s

om

e p

rop

ert

y, ρ

1, ρ 2

are

pro

ba

bili

tie

s

�ρ 1∼“

ve

ry u

nlik

ely

”, ρ

2∼“

ve

ry lik

ely

”

�S

tra

igh

t p

riva

cy b

rea

ch

:

P(Q

(x))

≤ρ 1

, b

ut P

(Q(x

) | R

(x)=

y) ≥ρ 2[E

vfim

ievski e

t a

l.]

slid

e 4

0

P(Q

(x))

≤ρ 1

, b

ut P

(Q(x

) | R

(x)=

y) ≥ρ 2

�Q

(x)

is u

nlik

ely

a p

rio

ri, b

ut lik

ely

afte

r se

ein

g

ran

do

miz

ed

va

lue

of

x

�In

ve

rse

pri

va

cy b

rea

ch

:

P(Q

(x))

≥ρ 2

, b

ut P

(Q(x

) | R

(x)=

y) ≤ρ 1

�Q

(x)

is lik

ely

a p

rio

ri, b

ut u

nlik

ely

afte

r se

ein

g

ran

do

miz

ed

va

lue

of

x

How to check privacy breach

�H

ow

to

en

su

re t

ha

t ra

nd

om

iza

tio

n o

pe

rato

r

hid

es e

ve

ry p

rop

ert

y?

�T

he

re a

re 2

|X|p

rop

ert

ies

�O

fte

n r

an

do

miz

atio

n o

pe

rato

r h

as to

be

se

lecte

d

slid

e 4

1

�O

fte

n r

an

do

miz

atio

n o

pe

rato

r h

as to

be

se

lecte

d

eve

n b

efo

re d

istr

ibu

tio

n P

xis

kn

ow

n (

wh

y?

)

�Id

ea

: lo

ok a

t o

pe

rato

r’s tra

nsitio

n p

rob

ab

ilitie

s

�H

ow

lik

ely

is x

ito

be

ma

pp

ed

to

a g

ive

n y

?

�In

tuitio

n: if a

ll p

ossib

le v

alu

es o

f x

ia

re e

qu

ally

like

ly to

be

ra

nd

om

ize

d to

a g

ive

n y

, th

en

reve

alin

g y

=R

(xi) w

ill n

ot

reve

al m

uch

ab

ou

t

actu

al va

lue

of x

i

Amplification

�R

an

do

miz

atio

n o

pe

rato

r is

γ-a

mp

lifyin

g f

or

y if

γ≤

→→∈

∀

y)

p(x

y)

p(x

:V

x ,

x 21

x2

1

[Evfim

ievski e

t a

l.]

slid

e 4

2

�F

or

giv

en

ρ1, ρ 2

, n

o s

tra

igh

t o

r in

ve

rse

pri

va

cy

bre

ach

es o

ccu

r if

γρρ

ρρ

) -

(1

) -

(1

21

12>

Amplification: Example

�R

1(x

) =

xw

ith

pro

b.

20

%;

a u

nifo

rmly

ran

do

m

nu

mb

er

with

pro

b. 8

0%

�R

2(x

) =

x+ξ

mo

d 1

00

1, ξ

un

ifo

rm in

[-1

00

,10

0]

�R

3(x

) =

R2(x

)w

ith

pro

b. 5

0%

, a

un

ifo

rmra

nd

om

slid

e 4

3

�R

3(x

) =

R2(x

)w

ith

pro

b. 5

0%

, a

un

ifo

rmra

nd

om

nu

mb

er

with

pro

b. 5

0%

�F

or

R3,

p(x→

y)

= ½

(1

/20

1 +

1/1

00

1)

if

y∈

[x-1

00

,x+

10

0]

½(0

+ 1

/10

01

)

o

the

rwis

e

�F

ractio

na

l d

iffe

ren

ce

= 1

+ 1

00

1/2

01

< 6

(=

γ)

�T

he

refo

re,

no

str

aig

ht o

r in

ve

rse

pri

va

cy

bre

ach

es w

ill o

ccu

r w

ith

ρ1=

14

%, ρ 2

=5

0%

Coming up

�M

ultip

lica

tive

no

ise

�O

utp

ut p

ert

urb

atio

n

privacy preserving data mining – introduction and ...lxiong/cs573_s12/share/slides/09ppdm.pdf ·...

Documents