data privacy and anonymization

30
Big Data and Attacks on Privacy: How to Properly Anonymize Social Networks and Databases (and Keep Them That Way) AC 298r Final Presentation Ryan Lee and Jeffrey Wang

Upload: jeffrey-wang

Post on 28-Nov-2014

126 views

Category:

Data & Analytics


0 download

DESCRIPTION

In the world of Big Data, there has been a lot of the research into creating efficient algorithms that can help us gain statistical insight from the large databases that record much of our life. However, as our digital footprint becomes larger, many databases that were originally considered anonymous can now be re-identified. How do we make sure that doesn't happen?

TRANSCRIPT

Page 1: Data Privacy and Anonymization

Big Data and Attacks on Privacy: How to Properly Anonymize Social Networks and Databases (and Keep Them That Way)AC 298r Final PresentationRyan Lee and Jeffrey Wang

Page 2: Data Privacy and Anonymization

Obligatory Social Network Stats

http://www.mediabistro.com/alltwitter/files/2013/11/growth-of-social-media-2013.jpg

Page 3: Data Privacy and Anonymization

Uses of Social Data: Research

Bollen et al. (2011). CS109 Harvard Univ.Fall 2013

Christakis & Fowler (2010). Christakis & Fowler (2007).

Page 4: Data Privacy and Anonymization

Uses of Social Data: Marketing

Facebook.com

Bio-Rad

Page 5: Data Privacy and Anonymization

Chang, R., Lee, A., Ghoniem, M., Kosara, R., Ribarsky, W., Yang, J., ... & Sudjianto, A. (2008). Scalable and interactive visual analysis of financial wire transactions for fraud detection. Information visualization, 7(1), 63-76.

Uses of Social Data: Government

Page 6: Data Privacy and Anonymization

Challenge: Privacy

Page 7: Data Privacy and Anonymization

Naive Approach: Anonymization

Name Favorite Pizza Favorite Course

Ryan Lee Supreme AC298r

Jeffrey Wang Pepperoni AC298r

Daniel Weinstock Anchovies AC298r

Page 8: Data Privacy and Anonymization

Naive Approach: Anonymization

Name Favorite Pizza Favorite Course

Ryan Lee Supreme AC298r

Jeffrey Wang Pepperoni AC298r

Daniel Weinstock Anchovies AC298r

Page 9: Data Privacy and Anonymization

Priority: Security

Page 10: Data Privacy and Anonymization

Concern: Digital Footprint

NSA Data Warehouse

Page 11: Data Privacy and Anonymization

Deanonymization is Possible

Sweeny, Fuzziness and Knowledge-based Systems, 2002

Page 12: Data Privacy and Anonymization

Netflix Prize 2

Page 13: Data Privacy and Anonymization

Netflix De-anon: How they did it● 500,000 record dataset was super-sparse

Netflix “Anonymized” DataPublic Data (IMDb, twitter, blogs, etc.)

Match if: time < thresholdmovie rating < threshold

Names

Page 14: Data Privacy and Anonymization

Surnames in Genomic Sequences

TACATA is a real last name...

Page 15: Data Privacy and Anonymization

“Anonymized” Cell Phone Data

de Montjoye, Y. A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). Unique in the Crowd: The privacy bounds of human mobility. Scientific reports, 3.

Page 16: Data Privacy and Anonymization

Defenses (lol JK)

Page 17: Data Privacy and Anonymization

K-Anonymity

Sweeny, Fuzziness and Knowledge-based Systems, 2002

Page 18: Data Privacy and Anonymization

A Tough Problem

DOB, Gender, and ZIP Code is enough to uniquely identify 87% of US Citizens

Sweeny, Fuzziness and Knowledge-based Systems, 2002

Page 19: Data Privacy and Anonymization

Solution?

First Last Age Race

Harry Stone 34 African American

John Reyser 36 Caucasian

Beatrice Stone 34 African American

John Delgado 22 Hispanic

Sweeny, Fuzziness and Knowledge-based Systems, 2002

Page 20: Data Privacy and Anonymization

Solution: Suppression and Generalization

First Last Age Race

Harry Stone 34 African American

John Reyser 36 Caucasian

Beatrice Stone 34 African American

John Delgado 22 Hispanic

k=2: Polynomial Solution! (Simplex Matching)k>=3: NP-Hard (Graph Decomposition)

Sweeny, Fuzziness and Knowledge-based Systems, 2002

Page 21: Data Privacy and Anonymization

● Users are ε times less likely to be identified if they chose not to participate in the database

Differential Privacy

Dwork, ICALP, 2002

Page 22: Data Privacy and Anonymization

Anonymity in Social Networks

Peter S. Bearman, James Moody, and Katherine Stovel, Chains of affection: The structure of adolescent romantic and sexual networks, American Journal of Sociology 110, 44-91 (2004).

http://www-personal.umich.edu/~mejn/networks/addhealth.gif

High School Dating Network

Page 23: Data Privacy and Anonymization

Information-rich Network Structure

Backstrom, L., & Kleinberg, J. (2013). Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook. arXiv preprint arXiv:1310.6753.

Page 24: Data Privacy and Anonymization

Attacks on Social Networks

● Passive: Find yourselves● Active: structural steganography

http://www.cse.psu.edu/~asmith/courses/privacy598d/www/lec-notes/Attacking%20Social%20Network%20FINAL.pdf

No isomorphicNo automorphism

Page 25: Data Privacy and Anonymization

Obfuscating Social Networks

Zhou and Pei, KAIS, 2011

Page 26: Data Privacy and Anonymization

Part 1: Construct Min-DFS Tree for Neighborhood

Zhou and Pei, KAIS, 2011

Page 27: Data Privacy and Anonymization

2 Useful Properties

1. Social Networks follow a Power-Law Distribution

2. Social Networks typically have a small diameter (6 degrees of separation)

Page 28: Data Privacy and Anonymization

Step 2: Anonymize Similar Vertices

Zhou and Pei, KAIS, 2011

Page 29: Data Privacy and Anonymization

Step 3: ??? => Step 4: Profit!

Zhou and Pei, KAIS, 2011

Page 30: Data Privacy and Anonymization

thanks

bye