pattern recognition and applications lab privacy · protecting privacy is increasingly difficult...
TRANSCRIPT
Pattern Recognitionand Applications Lab
Universityof Cagliari, Italy
Department of Electrical and Electronic Engineering
Privacy
Giorgio Fumera
http://pralab.diee.unica.it
Outline
• Introduction– privacy issues in the information society– privacy in data release: microdata
• Techniques for protecting the privacy of microdata– identity disclosure: k-anonymity– attribute disclosure: ℓ-diversity, t-closeness– differential privacy
• Application examples– privacy-preserving data mining– location data– social networks
• Privacy issues in cloud scenarios
1
http://pralab.diee.unica.it
Resources
2
Ch. 9 Privacy
• V. Ciriani et al., Theory of Privacy and Anonymity, in: Algorithms and Theory of Computation
Handbook (2nd ed.), M. Atallah and M. Blanton (eds.), CRC Press, 2009
http://spdp.di.unimi.it/papers/cdfs-theory_privacy_anonymity.pdf
• S. De Capitani di Vimercati et al., Data Privacy: Definitions and Techniques, Int. J. of
Uncertainty, Fuzziness and Knowledge-Based Systems, 20(6): 793–818 (2012)
http://spdp.di.unimi.it/papers/ijufks2012.pdf
• P. Samarati and S. De Capitani di Vimercati, Cloud Security: Issues and Concerns, in:
Encyclopedia of Cloud Computing, S. Murugesan and I. Bojanova (eds.), Wiley, 2016
http://spdp.di.unimi.it/papers/sd-cloud_security.pdf
http://pralab.diee.unica.it
Introduction
3
http://pralab.diee.unica.it
Privacy issues in the information society
Privacy: a multifaceted concept whose meaning is context-dependent.
In the ICT field several aspects lead to privacy issues– huge amount of personal data collected, stored, and processed
(including user-generated data)– unclear data ownership– lack of control of the users on their own data– restricted access to information and its expensive processing are no
more valid protection measures
The rapid evolution of the ICT landscape leads to ever-changing privacy issues and privacy protection needs.
4
http://pralab.diee.unica.it
Privacy issues in the information society
Main kinds of data that are collected, stored, analysed and shared in digital form
– personal information acquired during online activities in everyday life• Internet browsing• social networks• online transactions• ...
– data released by public and private organisations (e.g., census data, businness data, medical data) for research or statistical purposes, or because of laws and regulations• aggregate statistical data• data about specific individuals or organizations
– outsourcing data storage and computation (cloud services)
5
http://pralab.diee.unica.it
Privacy issues in the information society
Examples of privacy protection needs– the identity of users should be protected– sensitive information about users should be kept private– users’ actions (e.g., Web browsing data) should not be traceable
Protecting privacy is increasingly difficult due to– the availability of different information sources whose analysis and
correlation (linking) can allow leakage of information not intended for disclosure
– the availability of sophisticated techniques (e.g., data mining) to automatically analyse and correlate huge sources of information
6
http://pralab.diee.unica.it
Privacy issues in the information society
In current ICT landscape users interact with remote information sources to using on-line services and for retrieving data.
Three main technological aspects of privacy can be identified in this context:
– privacy of the user– privacy of the communication– privacy of the information
7
http://pralab.diee.unica.it
Privacy of the user
Protecting the identities of the parties that communicate through a network, to avoid tracing
– who is communicating with whom– who is interacting with which server or searching for which data
Main solution: techniques and protocols to guarantee an anonymous communication (e.g., Onion Routing, Tor)
– sender anonymity– recipient anonymity
8
http://pralab.diee.unica.it
Privacy of the communication
Two main aspects related to confidentiality of the information– protecting the content of personal information sent through a
network – main kind of technique: encryption protocols (e.g., SSL)– protecting the content of service requests against misuse by providers
(e.g., against user profiling)• private information retrieval• secure multi-party computation• privacy-preserving statistical analysis• privacy-preserving data mining
9
http://pralab.diee.unica.it
Privacy of the information
Privacy of the information refers to data collected, stored and possibly publicly released by public and private organizations about individuals and organizations
– definition of privacy policies (e.g., EU's General Data Protection Regulation – GDPR)• data holder's responsibility of data use and dissemination• user's right on data use, dissemination, disclosure, correction
– development of technologies for ensuring data protection
Main issue: protecting the anonymity of data owners– identity disclosure protection (against re-identification)– attribute disclosure protection (sensitive data)– inference channel protection (inference, data association)
10
http://pralab.diee.unica.it
Privacy of the information
To protect user anonymity specific norms limit the use of collected data to specific purposes (historical, statistical or scientific), provided that appropriate safeguards are applied.
Safeguards depend on the data release method. Two main data release methods exist:
– macrodata and statistical tables– microdata
This course shall focus on microdata privacy protection
11
http://pralab.diee.unica.it
Data release: macrodata, statistical databases
Main form of data release in the past– macrodata: aggregate information (statistics) on users or
organizations, usually in the form of two-dimensional tables– statistical tables: databases from which only aggregate statistics can
be retrieved by users through a DBMS
Different organizations need to make these kinds of data publicly available, e.g.:
– government agencies: historical data (e.g., census data, medical data)– private organizations: businness-related data (e.g., products and
sales)
Some examples:– EUROSTAT (the statistical office of the European Union)– ISTAT (Italian National Institute of Statistics)
12
http://pralab.diee.unica.it
Data release: macrodata, statistical databases
Main protection techniques– macrodata: selective obfuscation of sensitive cells– statistical databases
• restricting the statistical queries that can be made or the data that can be published
• returning the user a modified result, either at storage time or at run time
13
http://pralab.diee.unica.it
Data release: microdata
Nowadays the release of microdata, i.e., data about specificindividuals or organizations (respondents), is necessary
– pros: increased flexibility and availability of information to users– cons: increasing risks of privacy breaches against the anonymity of
respondents
Microdata are usually released as two-dimensional tables.– a toy example for
medical data
14
http://pralab.diee.unica.it
Privacy issues in microdata
Basic measure to protect user anonymity: de-identification– encrypting identifiers– removing identifiers
15
http://pralab.diee.unica.it
Privacy issues in microdata: data linking
However de-indentification does not guarantee anonymity.Other attributes, named quasi-identifiers (e.g., date of birth, sex, ZIP code), can be linked with external and publicly availableinformation to
– re-identify respondents– reduce the uncertainty
on their identities– infer sensitive information
not intended for disclosure
16
http://pralab.diee.unica.it
Data linking: toy example
17
http://pralab.diee.unica.it
Real-world examples of data linking
• U.S. census data (2000)• America OnLine (AOL) incident (2006)• Netflix incident (2006)
18
http://pralab.diee.unica.it
U.S. census data (2000)
A study carried out in 2006 showed that a considerable fraction of the U.S. population can be uniquely identified by
– gender
– location (either ZIP code or county)
– date of birth (year, year and month, full date)
19
P. Golle, Revisiting the Uniqueness of Simple Demographics in the US Population, Proc. WPES’06,pp. 77–80, ACM, 2006. Available at: https://crypto.stanford.edu/~pgolle/papers/census.pdf
http://pralab.diee.unica.it
AOL incident (2006)
America OnLine (AOL, an Internet services and media company) released in 2006 around 20 million search records of 650,000 customers for research purposes.
Records were de-identified by replacing personal identifiers with numerical identifiers (ID).
Records contained– ID– the term(s) used for the search– the timestamp– whether the user clicked on a result, and the corresponding website
20
http://pralab.diee.unica.it
AOL incident (2006)
A sample of the data relesed by AOL, related to user IDs 116874 and 117020:
116874 thompson water seal 2006-05-24 11:31:36 1 http://www.thompsonwaterseal.com116874 knbt 2006-05-31 07:57:28116874 knbt.com 2006-05-31 08:09:30 1 http://www.knbt.com117020 texas penal code 2006-03-03 17:57:38 1 http://www.capitol.state.tx.us117020 homicide in hook texas 2006-03-08 09:47:35117020 homicide in bowle county 2006-03-08 09:48:25 6 http://www.tdcj.state.tx.us
21
http://pralab.diee.unica.it
AOL incident (2006)
22
Two reporters of the New York Times were able to re-identify the AOL customer with ID 4417749:• Thelma Arnold• 62 yeas old widow• living in Lilburn
The released data were immediately removed.
http://pralab.diee.unica.it
Netflix incident (2006)
Netflix (an on-line movies renting service) launched in 2006 the Netflix Prize competition, offering $1 million to anyone who could improve its movie recommendation algorithm based on customer stored data.
To this aim Netflix released 100 million records containing the ratings given by 500,000 users to the movies they rent.
Records were de-identified by replacing personal identifiers with numerical identifiers.
23
http://pralab.diee.unica.it
Netflix incident (2006)
24
Some researchers were able to de-anonymize the data by comparing it with publicly available ratings on the Internet Movie Database (IMDB).
As an example, a lesbian mother was re-identified, causing the disclosure of her sexual orientation.
The contest was canceled after a privacy lawsuit.
http://pralab.diee.unica.it
Research participant identification (2013)
25
A study published in 2013 showed that the identities of people who participate in genetic research studies can be discovered by cross-referencing their data with publicly available information:https://www.nature.com/news/privacy-protections-the-genome-hacker-1.12940
http://pralab.diee.unica.it
Relevant sources of microdata: an example
Statistical institutes:– EUROSTAT – The statistical office of the European Union
• https://ec.europa.eu/eurostat/• information on available microdata:
https://ec.europa.eu/eurostat/web/microdata/public-microdata– ISTAT – Istituto Nazionale di Statistica (Italian National Institute of
Statistics)• https://www.istat.it – https://www.istat.it/en• information on available microdata:
https://www.istat.it/en/analysis-and-products/microdata-files
26