Social network analysis -- transformed

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

16

316

2

8

5

11

9

21

13

24

7 30

32

2215

262712

10 129

25

17

20

14

34

4

19

3

33

28

23

18

1977 2008

“Zachary’s Karate Club”

W. W. Zachary

An information flow model for conflict

and fission in small groups

Journal of Anthropological Research

J. Leskovec and E. Horvitz.

Planetary-scale views on a large

instant-messaging network

Conference on the World Wide Web

Global IM Network

180 million nodes

1.3 billion edges

34 nodes

78 edges


The opportunity is not being realized

• Common outcomes:

• No availability

• Limited availability

• only within institutions who own the data, or a among limited

set of researchers who have negotiated access.

• Availability, at a cost

• Privacy of participants may be violated, bias or inaccuracy in

released data.

Major obstacle to data dissemination: privacy risk.


Analysis of private networks

Example analyses based on network topology:

• Properties of the degree distribution

• Motif analysis

• Community structure

• Processes on networks: routing, rumors, infection

• Resiliency

Can we permit analysts to study networks without revealing sensitive information about participants?


Synthetic data release

Original network

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

Anonymization

DATA OWNER ANALYST

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

16

316

2

8

5

11

9

21

13

24

7 30

32

2215

262712

10 129

25

17

20

14

34

4

19

3

33

28

23

18

Naive anonymization



Original network

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

Anonymization

DATA OWNER ANALYST

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

16

316

2

8

5

11

9

21

13

24

7 30

32

2215

262712

10 129

25

17

20

14

34

4

19

3

33

28

23

18

Naive anonymization

• Create topological similarity [Liu, SIGMOD 08] [Zhou, ICDE 08]

[Zou, VLDB 09]



Original network

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

Anonymization

DATA OWNER ANALYST

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

16

316

2

8

5

11

9

21

13

24

7 30

32

2215

262712

10 129

25

17

20

14

34

4

19

3

33

28

23

18

Naive anonymization

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

16

316

2

8

5

11

9

21

13

24

7 30

32

2215

262712

10 129

25

17

20

14

34

4

19

3

33

28

23

18

Randomized Edges

• Randomize edges [Rastogi, VLDB 07] [Ying, SDM 2008]


[Zou, VLDB 09]



Original network

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

Anonymization

DATA OWNER ANALYST

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

16

316

2

8

5

11

9

21

13

24

7 30

32

2215

262712

10 129

25

17

20

14

34

4

19

3

33

28

23

18

Naive anonymization

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

16

316

2

8

5

11

9

21

13

24

7 30

32

2215

262712

10 129

25

17

20

14

34

4

19

3

33

28

23

18

Randomized Edges

3

4

3

1

33

16

316

2

8

5

11

9

21

13

24

730

32

22

15

26

27

12

101

29

25

17

20

14

34

4

19

3

33

28

23

18

1

76

3

5

2

1

2

3

13

6

4 4

31

5

1

Node clustering

• Randomize edges [Rastogi, VLDB 07] [Ying, SDM 2008]

• Clustering/summarization [Campan, PinKDD 08] [Hay, VLDB 08]

[Cormode, VLDB 08] [Cormode, VLDB 09]


[Zou, VLDB 09]


Output perturbation

• Dwork, McSherry, Nissim, Smith [Dwork, TCC 06] have described an

output perturbation mechanism satisfying differential privacy.

• Comparatively few results for these techniques applied to graphs.

Original network

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

query

noisy result

DATA OWNER ANALYST

Q(G) +

Qrandomnoise


Synthetic data release v. output perturbation

Usability good

Privacy no formal guarantees

Accuracy no formal guarantees

Scalability sometimes bad

A

CB

J

I

W

D

E

F

G

Y

AaBb

M

DdCc

H PN

OQR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh Q(G) + noise

Q

Usability bad for practical analyses

Privacy formal guarantees

Accuracy provable bounds

Scalability very good

The best of both worlds ??

• Synthetic data release

• Output perturbation

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

MA

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

A

CB

J

I

W

D

E

F

G

Y

AaBb

M

DdCc

H PN

OQR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

• Model-based synthetic data


Outline of the talk

• Differential privacy defined, adapted to graphs

• The accuracy of two topological analyses under differential privacy:

• Motif analysis

• Degree distribution

• Future goals and open questions


The differential guarantee

Two graphs are neighbors if they differ by at most one edge

Alice Bob Carol

Dave Ed

Fred Greg Harry

GQ(G) + noise

QA

Alice Bob Carol

Dave Ed

Fred Greg Harry

G’Q(G’) + noise

QA

DATA OWNER ANALYST

Indistinguishable

outputs


Differential privacy

A randomized algorithm A provides !-differential privacy if:

for all neighboring graphs G and G’, and

for any set of outputs S:

! = 0.1 e! ! 1.10Epsilon is usually small: e.g. if then

epsilon is a

privacy parameter

epsilon = stronger privacy

[Dwork, TCC 06]


Calibrating noise

• The following algorithm for answering Q is !-differentially private:

Q(G) + Laplace("Q / !)

Q

true

answer

sample from

scaled distribution

privacy

parametersensitivity of Q

A

!Q: The maximum change in Q, over any two neighboring graphs

[Dwork, TCC 06]


Examples of query sensitivity

Alice Bob Carol

Dave Ed

Fred Greg Harry

G

0

0.15

0.3

-10 -8 -6 -4 -2 0 2 4 6 8 10

Q "Q Q(G) Lap("Q / !)

degA (degree of node A) 1 degDave(G) = 4 4+Lap(2)

cnti (# nodes with degree i) 2 cnt4(G) = 4 4+Lap(4)

D=[degDave,degEd,degFred] 2 D(G) = [4,4,2][4+Lap(4), 4+Lap(4),

2+Lap(4)]

!=0.5

Laplace(4)

Laplace(2)

query sensitivity truth noise


Differential privacy for networks

• edge !-differential privacy: algorithm output

is largely indistinguishable whether or not any single edge is present or absent.

• k-edge !-differential privacy: algorithm

output is largely indistinguishable whether or not any set of k edges is present or absent.

• node !-differential privacy: algorithm output

is largely indistinguishable whether or not any single node (and all its edges) is present or absent.

A participant’s sensitive information is not a single edge.

Laplace("Q /! )

Laplace("Q k /! )

Suppose !Q=1. Then Laplace(100) satisfies:

1-edge 0.01-differential privacy

10-edge 0.1-differential privacy


Accurate motif analysis is hard

• Motif analysis measures the frequency of occurrence of small

subgraphs in a network.

• Common example: transitivity in the network:

• when A is friends with B and C, are B and C also friends?

• QTRIANGLE: return the number of triangles in the graph

...

n-2 nodes

A B

...

n-2 nodes

A B

G G’

QTRIANGLE (G) = 0 QTRIANGLE (G’) = n-2

High Sensitivity:

"QTRIANGLE=O(n)

For improved accuracy,

see: [Rastogi, PODS 09]

[Nissim, STOC 07] Sunday, September 13, 2009

Accurate degree sequence estimation is possible

• Degree sequence: the list of degrees of each node in a graph.

• A widely studied property of networks.

Alice Bob Carol

Dave Ed

Fred Greg Harry

[1,1,2,2,4,4,4,4]

degree sequence of Gorkut

x

CF(x)

0.0

0.2

0.4

0.6

0.8

1.0

s

0 1 4 9 19 49 99

G

Inverse cummulative distribution

Orkut

crawl


Alternative queries for the degree sequence

Alice Bob Carol

Dave Ed

Fred Greg Harry

Alice Bob Carol

Dave Ed

Fred Greg Harry

count queriescount queries vector queriesvector queries

degA return degree of node A D [degA, degB, ... ]

rnki return the rank ith degree

(asc)

S [rnk1, rnk2, ... cntn ]

G G’

D(G) = [1,4,1,4,4,2,4,2] D(G") = [1,4,1,3,3,2,4,2]

S(G) = [1,1,2,2,4,4,4,4] S(G") = [1,1,2,2,3,3,4,4]

"D=2

"S=2


Using the sort constraint

0

5

10

15

20

1 4 7 10 13 16 19 22 25

• The output of the sorted degree query is not (in general) sorted.

• We derive a new sequence by computing the closest non-

decreasing sequence: i.e. minimizing L2 distance.

S(G) true degree sequence

noisy observations ("=1.0)

inferred degree sequence


Experimental results

powerlaw, #=1.5, n=5M

!=.001

!=.01

0.0

0.2

0.4

0.6

0.8

1.0

0 4 9 49 499

0.0

0.2

0.4

0.6

0.8

1.0

0 4 9 49 499

(100-edge, 0.1-differential

privacy)

(10-edge, 0.1-differential

privacy)

original

noisy

inferred


Experimental results, continued

powerlaw

#=1.5, n=5M

livejournal

n=5.3M

orkut

n=3.1M

!=.001

!=.01

0.0

0.2

0.4

0.6

0.8

1.0

0 1 4 9 19 49

0.0

0.2

0.4

0.6

0.8

1.0

0 1 4 9 19 49

0.0

0.2

0.4

0.6

0.8

1.0

0 1 4 9 19 49 99

0.0

0.2

0.4

0.6

0.8

1.0

0 1 4 9 19 49 99

0.0

0.2

0.4

0.6

0.8

1.0

0 4 9 49 499

0.0

0.2

0.4

0.6

0.8

1.0

0 4 9 49 499

original

noisy

inferred


Inference does not weaken privacy

A

DATA OWNER

S(G) +

noise

S

Inference

1. Formulate S,

having constraints $S

!S

2. Submit S

3. Perform inference

$S

%S

ANALYST


Accuracy is improved without sacrificing privacy!

• Improved accuracy is possible because standard Laplace noise is

sufficient (but not necessary) for differential privacy.

• By using constraints and inference, we effectively apply a different

noise distribution -- more noise where it is needed, less otherwise.

• An immediate consequence: the accuracy achieved depends on

the input sequence.

O(dlog3n/!2)

!(n/!2)Mean squared error,!S

Mean squared error,%S

number of distinct degrees


Toward differentially-private synthetic data

• To realize the benefits of synthetic data, data owner can release

noisy parameters of network model.

• Baseline: the degree distribution as network model

• Deriving the power law parameter

• Measuring clustering coefficient

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

M

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

A

CB

J

I

W

D

E

F

G

Y

Aa Bb

M

DdCc

HPN

O QR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

A

CB

J

I

W

D

E

F

G

Y

AaBb

M

DdCc

H PN

OQR

T

Ee

U

Gg

V

L

S

K

Ff

X

Z

Hh

M!

noisy model

parameters

samples drawn from model

DATA OWNER ANALYST

very accurate

not constrained by deg. distr.


A few open questions

• For graph measure X, how accurately can we estimate X under

differential privacy?

• Is it socially acceptable to offer weaker privacy protection to high-

degree nodes (as in k-edge differential privacy)?

• Can we generate accurate synthetic networks under differential

privacy?

• Can the analyst explore an approximate synthetic network, then

query the original network under output perturbation?


Questions?

• [Hay, ICDM 09] M. Hay, C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree

distribution of private networks. In International Conference on Data Mining (ICDM), To

appear, 2009.

• [Hay, CoRR 09] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of

di&erentially-private queries through consistency. CoRR, abs/0904.0942, 2009. (under

review)

• [Rastogi, PODS 09] V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Relationship privacy:

Output perturbation for queries with joins. In Principles of Database Systems (PODS), 2009.

• [Hay, VLDB 08] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting structural

identification in anonymized social networks. In VLDB Conference, 2008.

Additional details on our work may be found here:


References

• [Liu, SIGMOD 08] K. Liu and E. Terzi. Towards identity anonymization on graphs. In

SIGMOD, 2008.

• [Zhou, ICDE 08] B. Zhou and J. Pei. Preserving privacy in social networks against

neighborhood attacks. In ICDE, 2008.

• [Zou, VLDB 09] L. Zou, L. Chen, and T. Ozsu. K-automorphism: A general framework

for privacy preserving network publication. In Proceedings of VLDB Conference, 2009.

• [Ying, SDM 2008] X. Ying and X. Wu. Randomizing social networks: a spectrum

preserving approach. In SIAM International Conference on Data Mining, 2008.

• [Cormode, VLDB 08] G. Cormode, D. Srivastava, T. Yu, and Q. Zhang. Anonymizing

bipartite graph data using safe groupings. In VLDB Conference, 2008.


References (con’t)

• [Cormode, VLDB 09] G. Cormode, D. Srivastava, S. Bhagat, and B. Krishnamurthy.

Class-based graph anonymization for social network data. In VLDB Conference, 2009.

• [Campan, PinKDD 08] A. Campan and T. M. Truta. A clustering approach for data and

structural anonymity in social networks. In PinKDD, 2008.

• [Rastogi, VLDB 07] V. Rastogi, S. Hong, and D. Suciu. The boundary between privacy

and utility in data publishing. In VLDB, pages 531–542, 2007.

• [Dwork, TCC 06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to

sensitivity in private data analysis. In Third Theory of Cryptography Conference, 2006.

• [Nissim, STOC 07] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and

sampling in private data analysis. In STOC, pages 75–84, 2007.


experiments in Sec 5.

3. UNATTRIBUTED HISTOGRAMSTo support unattributed histograms, one could use the

query sequence L. However, it contains “extra” information—the attribution of each count to a particular range—which isirrelevant for an unattributed histogram. Our observation isthat this extra information has a cost in terms of utility. Byeliminating this extra information from the query, we canboost accuracy.

We formulate a new query as follows. Since the associationbetween L[i] and i is not required, any permutation of theunit-length counts is a correct response for the unattributedhistogram. Let S denote the query sequence that returns thepermutation of L that is in ascending order. If U is the setof all unit length counts U = {c([xj ])|xj ! dom}, we writeranki(U) to refer to the ith element in sorted, ascendingorder. We define S = "rank1(U) . . . rankn(U)#.

Example 2. In the example in Fig 1, we have L(I) ="2, 0, 10, 2# while S(I) = "0, 2, 2, 10#.

To answer S under di!erential privacy, one can add thesame magnitude of noise as with L because S has the samesensitivity as L. (Appendix C states a general result aboutsorting by value. Informally, suppose L[i] = x and a recordis added so L[i] becomes x + 1. In S, let j be the largestindex such that S[j] = x, then S[j] becomes x + 1.) Thus,the following algorithm is !-di!erentially private:

S̃(I) = S(I) + "Lap(1/!)#n

While the accuracy of L̃ and S̃ is the same, S implies apowerful set of constraints: the sequence must be in order.We denote the constraint set "S, which consists of all theinequalities S[i] $ S[i+1] for 1 $ i < n. We show next howto exploit these constraints to boost accuracy.

3.1 Constrained Inference: computing S

If the di!erentially private output s̃ = S̃(I) is out of or-der, then it violates the inequality constraint set "S. Ourconstrained inference approach finds the sequence s that isclosest to s̃ and in order. How to compute s = minL2(s̃, "S)is not obvious. It turns out the solution has a surprisinglyelegant closed-form. Before stating the solution, we giveexamples.

Table 1: Examples of private outputs s̃ = S̃(I) andtheir closest ordered sequence s = minL2(s̃, "S).

original output inferred output distances̃ s ||s̃% s||2

"9, 10, 14# "9, 10, 14# 0"9, 14, 10# "9, 12, 12# 4

"14, 9, 10, 15# "11, 11, 11, 15# 14

Example 3. Table 1 shows examples of s̃ and its clos-est ordered sequence s. The first example s̃ = (9, 10, 14) isalready ordered, so s is equal to s̃. In the second example,s̃ = (9, 14, 10), the last two elements are out of order. Theclosest ordered sequence is s = (9, 12, 12). In the last exam-ple, the sequence is in order except for s̃[1]. While chang-ing the first element from 14 to 9 would make it ordered,this is not the closest solution: its distance from s̃ would be(14% 9)2 = 25 but s has distance 14 from s̃.

The minimum L2 solution is given in the following theo-rem. The proof appears in Appendix E.1. Let s̃[i, j] be thesubsequence of j % i + 1 elements: "s̃[i], s̃[i + 1], . . . , s̃[j]#.Let M [i, j] be the mean of these elements, i.e. M [i, j] =Pj

k=i s̃[k]/(j % i + 1).

Theorem 1. Denote Lk = minj![k,n] maxi![1,j] M [i, j] andUk = maxi![1,k] minj![i,n] M [i, j]. The minimum L2 solu-tion s = minL2(s̃, "S), is unique and given by: s[k] = Lk =Uk.

We compute minL2(s̃, "S) using a dynamic program basedon Uk, which can be written as Uk = max(Uk"1, minj![k,n] M [k, j])for k & 2. At each step k, the only computation is to find theminimum cumulative average M [k, j] for j = k, . . . , n andcompare it to Uk"1. Finding the minimum average takeslinear time for each k ! [1, n] so the total runtime is O(n2).

Example 4. To compute s when s̃ = "14, 9, 10, 15#, wefirst compute U1 = maxi![1,1] minj![i,4] M [i, j], which is sim-ply minj![1,4] M [1, j], i.e. the smallest mean among the meansfor subsequences starting from 1. The minimum is M [1, 3] =11. To compute U2, we first compute the smallest meanstarting from 2, which is minj![2,4] M [2, j] = M [2, 2] = 9.We then compare this to U1 and take the maximum. Thus,U2 = 11, as is U3. At U4, the smallest mean is 15, which islarger than 11, so U4 = 15. Thus s = (11, 11, 11, 15).

3.2 Utility Analysis: the accuracy of S

In this section, we analyze S and show that it has muchbetter utility than S̃ (and therefore much better utility thanL̃ for unattributed histograms). Before presenting the the-oretical statement of utility, we first give an example thatillustrates under what conditions S is likely to reduce error.

0 5 10 15 20 25

81

01

41

8

Index

Count

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

!

! ! ! !

!!

!

!

!

!!

!!

!

!

!!

!

!

!

!

!

!

!

!

! !!

!

!!= 1.0!

!

S

S~

S

Figure 2: Example sequence S(I), noisy estimatess̃ = S̃(I), and inferred estimates s = S(I).

Example 5. Fig 2 shows a sequence S(I) (open circles)along with a sampled s̃ (gray circles) and the inferred s (blacksquares). For the first twenty positions, s is very close to thetrue answer—the true sequence S(I) is uniform (S[i] = 10for i ! [1, 20]) and the noise of s̃ is e!ectively averaged out.The error is higher at the twenty-first position, which is aunique count in S(I) and constrained inference does not re-fine the noisy answer s[21] = s̃[21] = 11.4. The expected er-ror of S̃ (gray vertical bars) is uniformly equal to 2, whereasthe expected error of S (black bars) varies, smallest in themiddle of the uniform subsequence and largest at the twenty-first position (though still less than the expected error of S̃).

4


k

KS

Dis

tance

livejournal

s~

ssample

1 10 100 1000 10000

0.0

0.1

0.2

0.3

0.4

0.5

0.02 0.52 0.93 1 1

k

KS

Dis

tance

power

s~

ssample

1 10 100 1000 10000

0.0

0.1

0.2

0.3

0.4

0.5

0 0.97 1 1 1

k

KS

Dis

tance

orkut

s~

ssample

1 10 100 1000 10000

0.0

0.1

0.2

0.3

0.4

0.5

0 0.07 0.55 0.99 1


analyzing private network data using output perturbation

Documents