privacy in data management

1

Privacy in Data Management

Sharad Mehrotra

2

Privacy - definitions

Generic

- Privacy is the interest that individuals have in sustaining a 'personal space', free from interference by other people and organizations.

Information Privacy

- The degree to which an individual can determine which personal information is to be shared with whom and for what purpose.

- The evolving relationship between technology and the legal right to, or public expectation of privacy in the collection and sharing of data

Identity privacy (anonymity)

- Anonymity of an element (belonging to a set) refers to the property of that element of not being identifiable within the set, i.e., being indistinguishable from the other elements of the set

3

Means of achieving privacy

Information Security is the process of protecting data from unauthorized access, use, disclosure, destruction, modification, or disruption.

Enforcing security in information processing applications:

1. Law2. Access control3. Data encryption4. Data transformation – statistical disclosure control

Techniques used depend on - Application semantics/functionality requirements- Nature of data- Privacy requirement/metrics

Privacy is contextual

4

Overview

Study the nature of privacy in context of data-centric applications

1. Privacy-preserving data publishing for data mining applications

2. Secure outsourcing of data: “Database as A Service (DAS)”

3. Privacy-preserving implementation of pervasive spaces

4. Secure data exchange and sharing between multiple parties

5

Privacy-Preserving / Anonymmized

Data Publishing

6

Why Anonymize?

For Data Sharing Give real(istic) data to others to study without

compromising privacy of individuals in the data Allows third-parties to try new analysis and mining

techniques not thought of by the data owner For Data Retention and Usage

Various requirements prevent companies from retaining customer information indefinitely

E.g. Google progressively anonymizes IP addresses in search logs

Internal sharing across departments (e.g. billing marketing)

7

Why Privacy?

Data subjects have inherent right and expectation of privacy “Privacy” is a complex concept (beyond the scope of this

tutorial) What exactly does “privacy” mean? When does it apply? Could there exist societies without a concept of privacy?

Concretely: at collection “small print” outlines privacy rules Most companies have adopted a privacy policy E.g. AT&T privacy policy att.com/gen/privacy-policy?pid=2506

Significant legal framework relating to privacy UN Declaration of Human Rights, US Constitution HIPAA, Video Privacy Protection, Data Protection Acts

8

Case Study: US Census

Raw data: information about every US household Who, where; age, gender, racial, income and educational

data Why released: determine representation, planning How anonymized: aggregated to geographic areas (Zip code)

Broken down by various combinations of dimensions Released in full after 72 years

Attacks: no reports of successful deanonymization Recent attempts by FBI to access raw data rebuffed

Consequences: greater understanding of US population Affects representation, funding of civil projects Rich source of data for future historians and genealogists

9

Case Study: Netflix Prize

Raw data: 100M dated ratings from 480K users to 18K movies Why released: improve predicting ratings of unlabeled

examples How anonymized: exact details not described by Netflix

All direct customer information removed Only subset of full data; dates modified; some ratings

deleted, Movie title and year published in full

Attacks: dataset is claimed vulnerable [Narayanan Shmatikov 08] Attack links data to IMDB where same users also rated

movies Find matches based on similar ratings or dates in both

Consequences: rich source of user data for researchers unclear if attacks are a threat—no lawsuits or apologies yet

10

Case Study: AOL Search Data

Raw data: 20M search queries for 650K users from 2006 Why released: allow researchers to understand search patterns How anonymized: user identifiers removed

All searches from same user linked by an arbitrary identifier Attacks: many successful attacks identified individual users

Ego-surfers: people typed in their own names Zip codes and town names identify an area NY Times identified 4417749 as 62yr old GA widow [Barbaro

Zeller 06] Consequences: CTO resigned, two researchers fired

Well-intentioned effort failed due to inadequate anonymization

11

Three Abstract Examples

“Census” data recording incomes and demographics Schema: (SSN, DOB, Sex, ZIP, Salary) Tabular data—best represented as a table

“Video” data recording movies viewed Schema: (Uid, DOB, Sex, ZIP), (Vid, title, genre), (Uid, Vid) Graph data—graph properties should be retained

“Search” data recording web searches Schema: (Uid, Kw1, Kw2, …) Set data—each user has different set of keywords

Each example has different anonymization needs

12

Models of Anonymization

Interactive Model (akin to statistical databases) Data owner acts as “gatekeeper” to data Researchers pose queries in some agreed language Gatekeeper gives an (anonymized) answer, or refuses to

answer “Send me your code” model

Data owner executes code on their system and reports result

Cannot be sure that the code is not malicious Offline, aka “publish and be damned” model

Data owner somehow anonymizes data set Publishes the results to the world, and retires Our focus in this tutorial – seems to model most real

releases

13

Objectives for Anonymization

Prevent (high confidence) inference of associations Prevent inference of salary for an individual in “census” Prevent inference of individual’s viewing history in “video” Prevent inference of individual’s search history in “search” All aim to prevent linking sensitive information to an individual

Prevent inference of presence of an individual in the data set Satisfying “presence” also satisfies “association” (not vice-versa) Presence in a data set can violate privacy (eg STD clinic patients)

Have to model what knowledge might be known to attacker Background knowledge: facts about the data set (X has salary Y) Domain knowledge: broad properties of data (illness Z rare in

men)

14

Utility

Anonymization is meaningless if utility of data not considered The empty data set has perfect privacy, but no

utility The original data has full utility, but no privacy

What is “utility”? Depends what the application is… For fixed query set, can look at max, average

distortion Problem for publishing: want to support unknown

applications! Need some way to quantify utility of alternate

anonymizations

15

Measures of Utility

Define a surrogate measure and try to optimize Often based on the “information loss” of the anonymization Simple example: number of rows suppressed in a table

Give a guarantee for all queries in some fixed class Hope the class is representative, so other uses have low

distortion Costly: some methods enumerate all queries, or all

anonymizations Empirical Evaluation

Perform experiments with a reasonable workload on the result Compare to results on original data (e.g. Netflix prize problems)

Combinations of multiple methods Optimize for some surrogate, but also evaluate on real queries

16

Definitions of Technical Terms

Identifiers–uniquely identify, e.g. Social Security Number (SSN) Step 0: remove all identifiers Was not enough for AOL search data

Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code Enough to partially identify an individual in a dataset DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02]

Sensitive attributes (SA)—the associations we want to hide Salary in the “census” example is considered sensitive Not always well-defined: only some “search” queries

sensitive In “video”, association between user and video is

sensitive SA can be identifying: bonus may identify salary…

17

Summary of Anonymization Motivation

Anonymization needed for safe data sharing and retention Many legal requirements apply

Various privacy definitions possible Primarily, prevent inference of sensitive information Under some assumptions of background knowledge

Utility of the anonymized data needs to be carefully studied Different data types imply different classes of query

18

Privacy issues in data outsourcing (DAS) and cloud computing applications

19

Motivation

23

Example: DAS - Secure outsourcing of data management

Issues: Confidential information in data needs to be protected

Features – support queries on data: SQL, keyword based search-queries, XPath queries etc.

Performance - Bulk of work to be done on server, reduce communication overhead, client-side storage and post-processing of solutions.

Service Provider

Internet

Server

DB

Data owner/Client

24

Security model for DAS applications

Adversaries (A): Inside attackers: authorized users with malicious intent Outside attackers: hackers, snoopers

Attack models: Passive attacks: A wants to learn confidential information Active attacks: A wants to learn confidential information +

actively modifies data and/or queries

Trust on server: Untrusted: normal hardware, data & computation visible Semi-trusted: trusted co-processors + limited storage Trusted: All hardware is trusted & tamper-proof

25

Secure data storage & querying in DAS

Service ProviderInternet

Server

DB

Data owner/Client

Security concern: “ssn” “salary” & “credit rating” is confidentialssn nam

ecredi

t ratin

g

salary

age

780 John bad 34K 32

876 Mary good 29K 40

: : : :How to execute queries on encrypted data?

e.g. Select * from R where salary [25K, 35K]

R

Encrypt the sensitive column values

Trivial solution: retrieve all rows to client, decrypt them and check for

predicate

We can do betterUse secure indices for query evaluation on

server

26

Data storage

R: Original Table (plain text)RS :Server side Table (encrypted + indexed)

ssn

name

sex credit rating

sal age

345 Tom Male Bad 34k 32

876 Mary Female

Good 29k 40

234 Jerry Male Good 45k 34

780 John Male Bad 39k 33

Server side data

Client side meta-data

etuple bucket

(^#&*%T%&4&7ERGTty^Q!%^&*

B2

&^$^G@UG^g&@^&&#G@@#(GW

B1

&*#($T%#$@$R@@$#@^FG$%&

B3

&*#($T%#$@$R@@$#@^FG$%&

B2

Encrypt

0 20 30 40 50

B0 B1 B2 B3

buckets

• Encrypt the rows

• Partition salary values into buckets

• Index the etuples by their bucket-labels

27

Querying encrypted data

ssn

name

sex credit rating

sal age

345

Tom Male Bad 34k 32

876

Mary Female

Good 29k 40

234 Jerry Male Good 45k 34

780 John Male Bad 39k 33

Client side Table (plain text) R

Server side Table (encrypted + indexed) RS

Client-side query

Server-side query

Select etuple from RS where bucket = B1 ∨ B2

Select * from R where sal [25K, 35K]

Client side Table (plain text) R

etuple bucket

(^#&*%T%&4&7ERGTty^Q!%^&*

B2

&^$^G@UG^g&@^&&#G@@#(GW

B1

&*#($T%#$@$R@@$#@^FG$%&

B3

&*#($T%#$@$R@@$#@^FG$%&

B2

False positive

Client side data

0 20 30 40 50

B0 B1 B2 B3

28

Problems to address

Security analysis Goal: To hide away the confidential information in

data from server-side adversaries (DB admins etc.) Quantitative measures of disclosure-risk

Quality of partitioning (bucketization) Data partitioning schemes Cost measures

Tradeoff Balancing the two competing goals of security &

performance

Continued later …

2929

Privacy in Cloud Computing

What is cloud computing? Many definition exist Cloud computing is a model for enabling convenient, on-demand

network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [NIST]

Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically re-configured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized service-level agreements. [Luis M. Vaquero et al., Madrid Spain]

3030


Actors Service Providers

Provide software services (Ex: Google, Yahoo, Microsoft, IBM, etc…)

Service Users Personal, business, government

Infrastructure Providers Provide the computing infrastructure required to host services

Three cloud services Cloud Software as a Service (SaaS)

Use provider’s applications over a network Cloud Platform as a Service (PaaS)

Deploy customer-created applications to a cloud Cloud Infrastructure as a Service (IaaS)

Rent processing, storage, network capacity, and other fundamental computing resources

3131


Examples of cloud computing services Web-based email Photo storing Spreadsheet applications File transfer Online medical record storage Social network applications

3232


Privacy issues in cloud computing Cloud increases security and privacy risks Data

Creation, storage, communication – exponential rate Data replicated across large geographic distances Data contain personal identifiable information Data stored at untrusted hosts

Create enormous risks for data privacy lost of control of sensitive data Risk of sharing sensitive data with marketing

Other problem: technology ahead of law Does the user or the hosting company own the data? Can the host deny a user access to their own data? If the host company goes out of business, what happens to the

users' data it holds? How does the host protect the user's data?

3333


Solutions The cloud does not offer any privacy Awareness Some effort

Effort ACM Cloud Computing Security Workshop, November, 2009 ACM Symposium on Cloud Computing, June, 2010

Privacy in cloud computing at UCI Recently lunched a project on privacy-preservation in cloud

computing General approach: personal privacy middleware

34

Privacy preservation in Pervasive Spaces

40

Privacy in data sharing and exchange

41

Extra material

42

Example: Detecting a pre-specified set of events

No ordinary coffee room, one that is monitored !

There are rules that apply

If rule is violated, penalties may be imposed

But all is not unfair: individuals have right to privacy !

”Till an individual has not had more than his quota of

coffee, his identity will not be revealed”

Just like a coffee room !!

43

Issues to be addressed

Modeling pervasive spaces: How to capture events of interest E.g., “Tom had his 4th cup of coffee for the day”

Privacy goal: Guarantee anonymity to individuals What are the necessary and sufficient conditions?

Solution Design should satisfy the necessary and sufficient

conditions Practical/scalable

44

Basic events, Composite events & Rules

Model of pervasive space:

A stream of basic events

Composite event: one or more sequence of basic events

Rule: (Composite event, Action)

Rules apply to groups of individuals, e.g.: Coffee room rules apply to

everyone Server room rule applies

to everyone except administrators etc.

Pervasive Space with

sensors

ek:<Bill, coffee-room, coffee-maker, exit>

e2:<Tom, coffee-room, coffee-cup, dispense>e1:<Tom, coffee-room, *, enter>

::

::

Str

eam

of

basi

c events

45

Composite-events & automaton templates

Composite-event templates

“A student drinks more than 3 cups of coffee”

e1 ≡ <u ∈ STUDENT, coffee_room, coffee_cup, dispense>

“A student tries to access the IBM machine in the server room”

e1 ≡ <u ∈ STUDENT,server_room,*, entry>

e2 ≡ <ū, server_room, *, exit>e3 ≡ <ū, server_room, IBM-mc, login-

attempt>

1e1 e3

e2

S0 SF

¬(e3 V e2)

1 2 3

¬e1

e1 e1 e1 e1S0 SF

¬e1 ¬e1

46

System architecture & adversary

State Informati

on

Secure Sensor

node (SSN)

Server

Secure Sensor

node (SSN)

Rules DB

::

Basic Assumptions about SSNs Trusted hardware (Sensors are tamper-proof) Secure data capture & generation of basic events by SSN Limited computation + storage capacity: can carry out encryption/decryption

with secret key common to all SSNs, automaton transition

1 2 3

¬e1

e1 e1 e1 e1S0 SF

¬e1 ¬e1

1 2 3

¬e1

e1 e1 e1 e1S0 SF

¬e1 ¬e1

1 2 3

¬e1

e1 e1 e1 e1S0 SF

¬e1 ¬e1

Thin trusted middleware to obfuscate origin of events

47

Privacy goal & Adversary’s knowledge

Minimum requirement to ensure anonymity: State information (automatons) are always kept encrypted on server

Ensure k-anonymity for each individual

(k-anonymity is achieved when each individual is indistinguishable from at least k-1 other individuals associated with the space )

Passive adversary (A): Server-side snooper who wants to deduce the identity of the individual associated with a basic-event

A knows all rules of the space & automaton structures A can observe all server-side activities A has unlimited computation power

48

Basic protocol

Return automatons that (possibly)

match e (encrypted match)

Store updated automatons

Encrypted query for automatons

that make transition on e

Decrypt automatons,

advance the state of automatons if

necessary

associate encrypted label with new state.

Write-back encrypted

automatons

SERVER

SECURE SENSOR NODE

Generate basic event e

Question: Does encryption ensure anonymity?NO! pattern of automaton access may reveal identity

49

Example

U enters kitchen

U takes coffee

U enters kitchen

U opens fridge

U enters kitchen

U opens microwav

e

U enters kitchen

U takes coffee

U enters kitchen

U opens fridge

R1

R3

R2

R1

R2

Applies to Tom

Tom enters Kitchen 3 firings

Applies to Bill

Bill enters Kitchen 2 firings

On an event, the # rows retrieved from state table can disclose the identity of the individual

50

Characteristic access patterns of automatons

The characteristic access patterns of rows can potentially reveal the identity of the automaton in spite of encryption

Tom enters kitchen

Tom takes coffee

Tom enters kitchen

Tom takes

coffee

Tom enters kitchen

Tom opens fridge

x

z

y

Tom leaves coffee pot

empty

Tom opens fridge

Tom leaves fridge open

Characteristic patterns of x

P1: {x,y,z} {x y}

Characteristic patterns of y

P2: {x,y,z} {x,y} {y}P3: {x,y,z} {y,z} {y}

Characteristic patterns of z

P4: {x,y,z} {y z}

The set of rules applicable to an individual maybe unique potentially identify the individual

Rules applicable to

TOM

51

Solution scheme

Formalized the notion of indistinguishability of automatons in terms of their access patterns

Identified “event clustering” as a mechanism for inducing indistinguishability for achieving k-anonymity

Proved the difficulty of checking for k-anonymity Characterized the class of event-clustering schemes that achieve k-

anonymity

Proposed an efficient clustering algorithm to minimize average execution overhead for protocol

Implemented a prototype system

Challenges: Designing a truly secure sensing-infrastructure is challenging Key management issues Are there other interesting notions of privacy in pervasive space?

privacy in data management

Documents

information privacy

data subjects

context of data

data ownerfor data retention

concept of privacy

privacy of individuals

privacy policye

nature of privacy