privacy in data management
DESCRIPTION
Privacy in Data Management. Sharad Mehrotra. Privacy - definitions. Generic Privacy is the interest that individuals have in sustaining a 'personal space', free from interference by other people and organizations. Information Privacy - PowerPoint PPT PresentationTRANSCRIPT
1
Privacy in Data Management
Sharad Mehrotra
2
Privacy - definitions
Generic
- Privacy is the interest that individuals have in sustaining a 'personal space', free from interference by other people and organizations.
Information Privacy
- The degree to which an individual can determine which personal information is to be shared with whom and for what purpose.
- The evolving relationship between technology and the legal right to, or public expectation of privacy in the collection and sharing of data
Identity privacy (anonymity)
- Anonymity of an element (belonging to a set) refers to the property of that element of not being identifiable within the set, i.e., being indistinguishable from the other elements of the set
3
Means of achieving privacy
Information Security is the process of protecting data from unauthorized access, use, disclosure, destruction, modification, or disruption.
Enforcing security in information processing applications:
1. Law2. Access control3. Data encryption4. Data transformation – statistical disclosure control
Techniques used depend on - Application semantics/functionality requirements- Nature of data- Privacy requirement/metrics
Privacy is contextual
4
Overview
Study the nature of privacy in context of data-centric applications
1. Privacy-preserving data publishing for data mining applications
2. Secure outsourcing of data: “Database as A Service (DAS)”
3. Privacy-preserving implementation of pervasive spaces
4. Secure data exchange and sharing between multiple parties
5
Privacy-Preserving / Anonymmized
Data Publishing
6
Why Anonymize?
For Data Sharing Give real(istic) data to others to study without
compromising privacy of individuals in the data Allows third-parties to try new analysis and mining
techniques not thought of by the data owner For Data Retention and Usage
Various requirements prevent companies from retaining customer information indefinitely
E.g. Google progressively anonymizes IP addresses in search logs
Internal sharing across departments (e.g. billing marketing)
7
Why Privacy?
Data subjects have inherent right and expectation of privacy “Privacy” is a complex concept (beyond the scope of this
tutorial) What exactly does “privacy” mean? When does it apply? Could there exist societies without a concept of privacy?
Concretely: at collection “small print” outlines privacy rules Most companies have adopted a privacy policy E.g. AT&T privacy policy att.com/gen/privacy-policy?pid=2506
Significant legal framework relating to privacy UN Declaration of Human Rights, US Constitution HIPAA, Video Privacy Protection, Data Protection Acts
8
Case Study: US Census
Raw data: information about every US household Who, where; age, gender, racial, income and educational
data Why released: determine representation, planning How anonymized: aggregated to geographic areas (Zip code)
Broken down by various combinations of dimensions Released in full after 72 years
Attacks: no reports of successful deanonymization Recent attempts by FBI to access raw data rebuffed
Consequences: greater understanding of US population Affects representation, funding of civil projects Rich source of data for future historians and genealogists
9
Case Study: Netflix Prize
Raw data: 100M dated ratings from 480K users to 18K movies Why released: improve predicting ratings of unlabeled
examples How anonymized: exact details not described by Netflix
All direct customer information removed Only subset of full data; dates modified; some ratings
deleted, Movie title and year published in full
Attacks: dataset is claimed vulnerable [Narayanan Shmatikov 08] Attack links data to IMDB where same users also rated
movies Find matches based on similar ratings or dates in both
Consequences: rich source of user data for researchers unclear if attacks are a threat—no lawsuits or apologies yet
10
Case Study: AOL Search Data
Raw data: 20M search queries for 650K users from 2006 Why released: allow researchers to understand search patterns How anonymized: user identifiers removed
All searches from same user linked by an arbitrary identifier Attacks: many successful attacks identified individual users
Ego-surfers: people typed in their own names Zip codes and town names identify an area NY Times identified 4417749 as 62yr old GA widow [Barbaro
Zeller 06] Consequences: CTO resigned, two researchers fired
Well-intentioned effort failed due to inadequate anonymization
11
Three Abstract Examples
“Census” data recording incomes and demographics Schema: (SSN, DOB, Sex, ZIP, Salary) Tabular data—best represented as a table
“Video” data recording movies viewed Schema: (Uid, DOB, Sex, ZIP), (Vid, title, genre), (Uid, Vid) Graph data—graph properties should be retained
“Search” data recording web searches Schema: (Uid, Kw1, Kw2, …) Set data—each user has different set of keywords
Each example has different anonymization needs
12
Models of Anonymization
Interactive Model (akin to statistical databases) Data owner acts as “gatekeeper” to data Researchers pose queries in some agreed language Gatekeeper gives an (anonymized) answer, or refuses to
answer “Send me your code” model
Data owner executes code on their system and reports result
Cannot be sure that the code is not malicious Offline, aka “publish and be damned” model
Data owner somehow anonymizes data set Publishes the results to the world, and retires Our focus in this tutorial – seems to model most real
releases
13
Objectives for Anonymization
Prevent (high confidence) inference of associations Prevent inference of salary for an individual in “census” Prevent inference of individual’s viewing history in “video” Prevent inference of individual’s search history in “search” All aim to prevent linking sensitive information to an individual
Prevent inference of presence of an individual in the data set Satisfying “presence” also satisfies “association” (not vice-versa) Presence in a data set can violate privacy (eg STD clinic patients)
Have to model what knowledge might be known to attacker Background knowledge: facts about the data set (X has salary Y) Domain knowledge: broad properties of data (illness Z rare in
men)
14
Utility
Anonymization is meaningless if utility of data not considered The empty data set has perfect privacy, but no
utility The original data has full utility, but no privacy
What is “utility”? Depends what the application is… For fixed query set, can look at max, average
distortion Problem for publishing: want to support unknown
applications! Need some way to quantify utility of alternate
anonymizations
15
Measures of Utility
Define a surrogate measure and try to optimize Often based on the “information loss” of the anonymization Simple example: number of rows suppressed in a table
Give a guarantee for all queries in some fixed class Hope the class is representative, so other uses have low
distortion Costly: some methods enumerate all queries, or all
anonymizations Empirical Evaluation
Perform experiments with a reasonable workload on the result Compare to results on original data (e.g. Netflix prize problems)
Combinations of multiple methods Optimize for some surrogate, but also evaluate on real queries
16
Definitions of Technical Terms
Identifiers–uniquely identify, e.g. Social Security Number (SSN) Step 0: remove all identifiers Was not enough for AOL search data
Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code Enough to partially identify an individual in a dataset DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02]
Sensitive attributes (SA)—the associations we want to hide Salary in the “census” example is considered sensitive Not always well-defined: only some “search” queries
sensitive In “video”, association between user and video is
sensitive SA can be identifying: bonus may identify salary…
17
Summary of Anonymization Motivation
Anonymization needed for safe data sharing and retention Many legal requirements apply
Various privacy definitions possible Primarily, prevent inference of sensitive information Under some assumptions of background knowledge
Utility of the anonymized data needs to be carefully studied Different data types imply different classes of query
18
Privacy issues in data outsourcing (DAS) and cloud computing applications
19
Motivation
20
21
22
23
Example: DAS - Secure outsourcing of data management
Issues: Confidential information in data needs to be protected
Features – support queries on data: SQL, keyword based search-queries, XPath queries etc.
Performance - Bulk of work to be done on server, reduce communication overhead, client-side storage and post-processing of solutions.
Service Provider
Internet
Server
DB
Data owner/Client
24
Security model for DAS applications
Adversaries (A): Inside attackers: authorized users with malicious intent Outside attackers: hackers, snoopers
Attack models: Passive attacks: A wants to learn confidential information Active attacks: A wants to learn confidential information +
actively modifies data and/or queries
Trust on server: Untrusted: normal hardware, data & computation visible Semi-trusted: trusted co-processors + limited storage Trusted: All hardware is trusted & tamper-proof
25
Secure data storage & querying in DAS
Service ProviderInternet
Server
DB
Data owner/Client
Security concern: “ssn” “salary” & “credit rating” is confidentialssn nam
ecredi
t ratin
g
salary
age
780 John bad 34K 32
876 Mary good 29K 40
: : : :How to execute queries on encrypted data?
e.g. Select * from R where salary [25K, 35K]
R
Encrypt the sensitive column values
Trivial solution: retrieve all rows to client, decrypt them and check for
predicate
We can do betterUse secure indices for query evaluation on
server
26
Data storage
R: Original Table (plain text)RS :Server side Table (encrypted + indexed)
ssn
name
sex credit rating
sal age
345 Tom Male Bad 34k 32
876 Mary Female
Good 29k 40
234 Jerry Male Good 45k 34
780 John Male Bad 39k 33
Server side data
Client side meta-data
etuple bucket
(^#&*%T%&4&7ERGTty^Q!%^&*
B2
&^$^G@UG^g&@^&&#G@@#(GW
B1
&*#($T%#$@$R@@$#@^FG$%&
B3
&*#($T%#$@$R@@$#@^FG$%&
B2
Encrypt
0 20 30 40 50
B0 B1 B2 B3
buckets
• Encrypt the rows
• Partition salary values into buckets
• Index the etuples by their bucket-labels
27
Querying encrypted data
ssn
name
sex credit rating
sal age
345
Tom Male Bad 34k 32
876
Mary Female
Good 29k 40
234 Jerry Male Good 45k 34
780 John Male Bad 39k 33
Client side Table (plain text) R
Server side Table (encrypted + indexed) RS
Client-side query
Server-side query
Select etuple from RS where bucket = B1 ∨ B2
Select * from R where sal [25K, 35K]
Client side Table (plain text) R
etuple bucket
(^#&*%T%&4&7ERGTty^Q!%^&*
B2
&^$^G@UG^g&@^&&#G@@#(GW
B1
&*#($T%#$@$R@@$#@^FG$%&
B3
&*#($T%#$@$R@@$#@^FG$%&
B2
False positive
Client side data
0 20 30 40 50
B0 B1 B2 B3
28
Problems to address
Security analysis Goal: To hide away the confidential information in
data from server-side adversaries (DB admins etc.) Quantitative measures of disclosure-risk
Quality of partitioning (bucketization) Data partitioning schemes Cost measures
Tradeoff Balancing the two competing goals of security &
performance
Continued later …
2929
Privacy in Cloud Computing
What is cloud computing? Many definition exist Cloud computing is a model for enabling convenient, on-demand
network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [NIST]
Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically re-configured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized service-level agreements. [Luis M. Vaquero et al., Madrid Spain]
3030
Privacy in Cloud Computing
Actors Service Providers
Provide software services (Ex: Google, Yahoo, Microsoft, IBM, etc…)
Service Users Personal, business, government
Infrastructure Providers Provide the computing infrastructure required to host services
Three cloud services Cloud Software as a Service (SaaS)
Use provider’s applications over a network Cloud Platform as a Service (PaaS)
Deploy customer-created applications to a cloud Cloud Infrastructure as a Service (IaaS)
Rent processing, storage, network capacity, and other fundamental computing resources
3131
Privacy in Cloud Computing
Examples of cloud computing services Web-based email Photo storing Spreadsheet applications File transfer Online medical record storage Social network applications
3232
Privacy in Cloud Computing
Privacy issues in cloud computing Cloud increases security and privacy risks Data
Creation, storage, communication – exponential rate Data replicated across large geographic distances Data contain personal identifiable information Data stored at untrusted hosts
Create enormous risks for data privacy lost of control of sensitive data Risk of sharing sensitive data with marketing
Other problem: technology ahead of law Does the user or the hosting company own the data? Can the host deny a user access to their own data? If the host company goes out of business, what happens to the
users' data it holds? How does the host protect the user's data?
3333
Privacy in Cloud Computing
Solutions The cloud does not offer any privacy Awareness Some effort
Effort ACM Cloud Computing Security Workshop, November, 2009 ACM Symposium on Cloud Computing, June, 2010
Privacy in cloud computing at UCI Recently lunched a project on privacy-preservation in cloud
computing General approach: personal privacy middleware
34
Privacy preservation in Pervasive Spaces
40
Privacy in data sharing and exchange
41
Extra material
42
Example: Detecting a pre-specified set of events
No ordinary coffee room, one that is monitored !
There are rules that apply
If rule is violated, penalties may be imposed
But all is not unfair: individuals have right to privacy !
”Till an individual has not had more than his quota of
coffee, his identity will not be revealed”
Just like a coffee room !!
43
Issues to be addressed
Modeling pervasive spaces: How to capture events of interest E.g., “Tom had his 4th cup of coffee for the day”
Privacy goal: Guarantee anonymity to individuals What are the necessary and sufficient conditions?
Solution Design should satisfy the necessary and sufficient
conditions Practical/scalable
44
Basic events, Composite events & Rules
Model of pervasive space:
A stream of basic events
Composite event: one or more sequence of basic events
Rule: (Composite event, Action)
Rules apply to groups of individuals, e.g.: Coffee room rules apply to
everyone Server room rule applies
to everyone except administrators etc.
Pervasive Space with
sensors
ek:<Bill, coffee-room, coffee-maker, exit>
e2:<Tom, coffee-room, coffee-cup, dispense>e1:<Tom, coffee-room, *, enter>
::
::
Str
eam
of
basi
c events
45
Composite-events & automaton templates
Composite-event templates
“A student drinks more than 3 cups of coffee”
e1 ≡ <u ∈ STUDENT, coffee_room, coffee_cup, dispense>
“A student tries to access the IBM machine in the server room”
e1 ≡ <u ∈ STUDENT,server_room,*, entry>
e2 ≡ <ū, server_room, *, exit>e3 ≡ <ū, server_room, IBM-mc, login-
attempt>
1e1 e3
e2
S0 SF
¬(e3 V e2)
1 2 3
¬e1
e1 e1 e1 e1S0 SF
¬e1 ¬e1
46
System architecture & adversary
State Informati
on
Secure Sensor
node (SSN)
Server
Secure Sensor
node (SSN)
Rules DB
::
Basic Assumptions about SSNs Trusted hardware (Sensors are tamper-proof) Secure data capture & generation of basic events by SSN Limited computation + storage capacity: can carry out encryption/decryption
with secret key common to all SSNs, automaton transition
1 2 3
¬e1
e1 e1 e1 e1S0 SF
¬e1 ¬e1
1 2 3
¬e1
e1 e1 e1 e1S0 SF
¬e1 ¬e1
1 2 3
¬e1
e1 e1 e1 e1S0 SF
¬e1 ¬e1
Thin trusted middleware to obfuscate origin of events
47
Privacy goal & Adversary’s knowledge
Minimum requirement to ensure anonymity: State information (automatons) are always kept encrypted on server
Ensure k-anonymity for each individual
(k-anonymity is achieved when each individual is indistinguishable from at least k-1 other individuals associated with the space )
Passive adversary (A): Server-side snooper who wants to deduce the identity of the individual associated with a basic-event
A knows all rules of the space & automaton structures A can observe all server-side activities A has unlimited computation power
48
Basic protocol
Return automatons that (possibly)
match e (encrypted match)
Store updated automatons
Encrypted query for automatons
that make transition on e
Decrypt automatons,
advance the state of automatons if
necessary
associate encrypted label with new state.
Write-back encrypted
automatons
SERVER
SECURE SENSOR NODE
Generate basic event e
Question: Does encryption ensure anonymity?NO! pattern of automaton access may reveal identity
49
Example
U enters kitchen
U takes coffee
U enters kitchen
U opens fridge
U enters kitchen
U opens microwav
e
U enters kitchen
U takes coffee
U enters kitchen
U opens fridge
R1
R3
R2
R1
R2
Applies to Tom
Tom enters Kitchen 3 firings
Applies to Bill
Bill enters Kitchen 2 firings
On an event, the # rows retrieved from state table can disclose the identity of the individual
50
Characteristic access patterns of automatons
The characteristic access patterns of rows can potentially reveal the identity of the automaton in spite of encryption
Tom enters kitchen
Tom takes coffee
Tom enters kitchen
Tom takes
coffee
Tom enters kitchen
Tom opens fridge
x
z
y
Tom leaves coffee pot
empty
Tom opens fridge
Tom leaves fridge open
Characteristic patterns of x
P1: {x,y,z} {x y}
Characteristic patterns of y
P2: {x,y,z} {x,y} {y}P3: {x,y,z} {y,z} {y}
Characteristic patterns of z
P4: {x,y,z} {y z}
The set of rules applicable to an individual maybe unique potentially identify the individual
Rules applicable to
TOM
51
Solution scheme
Formalized the notion of indistinguishability of automatons in terms of their access patterns
Identified “event clustering” as a mechanism for inducing indistinguishability for achieving k-anonymity
Proved the difficulty of checking for k-anonymity Characterized the class of event-clustering schemes that achieve k-
anonymity
Proposed an efficient clustering algorithm to minimize average execution overhead for protocol
Implemented a prototype system
Challenges: Designing a truly secure sensing-infrastructure is challenging Key management issues Are there other interesting notions of privacy in pervasive space?