privacy, confidentiality and ethics...•relevance –data satisfy user needs •accessibility...

56
Privacy, Confidentiality and Ethics With thanks to more people than I can count, but especially Rayid Ghani, Arthur Kennickell, Frauke Kreuter, and George Putnam

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Privacy, Confidentiality and Ethics

With thanks to more people than I can count, but especially Rayid Ghani, Arthur Kennickell, Frauke Kreuter, and George Putnam

Page 2: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Key questions

• What are the legal requirements?• What are the rules of engagement?• What are the best ways to provide access

while also protecting confidentiality?• Are there reasonable mechanisms to

compensate citizens for privacy loss?• How can we built trustworthy curators?• What do we (need to) know about the

data generating process?• How can we increase linkage without

increasing risks?

Page 3: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Key ideas

• New types of data => enormous opportunity for public good

• Three issues for privacy research• Access is critical for measurement and policy

• Understand utility, risk and tradeoff

• Protect output

• Important and difficult agenda

Page 4: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Outline

• Context

• Framework

• Risk and Utility

• Three Approaches• Traditional

• Differential Privacy

• Secure Remote Access

• Vision

Page 5: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Outline

• Context

• Framework

• Risk and Utility

• Three Approaches• Traditional

• Differential Privacy

• Secure Remote Access

• Vision

Page 6: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

New types of data

6

Page 7: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Core Mission

Page 8: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Guiding principles

Page 9: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity
Page 10: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity
Page 11: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity
Page 12: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Motivation

Page 13: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Can Differentially Privatized Data be Used for Redistricting?

Andrew A. Beveridge, Queens College and Graduate Center CUNY and Social Explorer

Association of Public Data Users, Annual Conference, 2019. Key Bridge Marriott, Arlington VA

13

Page 14: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Data Products

Table Pl – Race (with multiple race, 71 items)

Table P2 - Hispanic or Latino, and not Hispanicor Latino by Race (with multiple races, 73 items)

Table P3 - Race for the Population 18 Years and Over (with multiple race, 71 items)

Table P4 - Hispanic or Latino, and not Hispanic or Latino by Race for the Populat ion 18 and Over

(with multiple race, 73 items)

Table Hl - Occupancy Status (Housing) (3 items)

Table PS - Group Quarters Population by Group Quarters Type (New Table) (10 items)

• Mult iple geographies including census block (Some 78)

• Group quarters is total population only, no demographic breakdown

• Final 2020 P.L. 94-171 Redistricting Data File design expected summer of 2019

8

Based upon the 2010 Census and the End-to-End test there willbe 78 Summary Levels by Some 299 items. This will result in about three billion table cells.

Can Differential Privacy Result in Data Accurate Enough to do Redistricting?

Page 15: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Geographic Productsl

Shapefiles

Maps

Block Assignment Files

Block to Block Relationship

Files

• Shapefiles - geographic information system geometry files

• Maps (PDF only) - County Block; Voting District/State Legislative District; Tract; School District

• Block Assignment Files - tables identifying the blocks used to build different geographic entities

• Block to Block Relationship Files - Crosswalk of 2010 blocks to 2020 blocks

All these product in service of redrawing districts, which needs to

be done in time for the 2022 primaries, which generally occur in

early summer 2022. Redistricting must be done by Spring 2022. (VA

and NJ have earlier deadlines for state elections in 2021.)7

Page 16: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

16

Page 17: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

White flag raised

https://www.census.gov/newsroom/blogs/random-samplings/2019/07/boost-safeguards.html

Page 18: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity
Page 19: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Outline

• Context

• Framework

• Risk and Utility

• Three Approaches• Historical

• Differential Privacy

• Secure Remote Access

• Vision

Page 20: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

What is …

Privacy

• includes the famous “right to be left alone,” and the ability to share information selectively but not publicly (White House 2014)

Confidentiality

• means “preserving authorized restrictions on information access and disclosure, including means for protecting personal privacy and proprietary information” (McCallister, Grance, and Scarfone 2010).

Page 21: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Why confidentiality important

• Promise to respondents

• Ethical requirement

• Legal requirement

• Practical implications

Page 22: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Why access important

• Data are dirty

• Datasets not well defined entities

• Linkages can be wrong

• Outliers are where the action is

Page 23: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity
Page 24: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Outline

• Context

• Framework

• Risk and Utility

• Three Approaches• Historical

• Differential Privacy

• Secure Remote Access

• Vision

Page 25: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

What is Risk?

• Reidentification• Noisy linkages

• Sample vs. Census

• Harm at record level

Individual

Group

Discounted

• Harm at item level

Page 26: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Utility measures (reminder)

• Completeness – data rich enough

• Timeliness – data deliveries adhere to schedules

• Relevance – data satisfy user needs

• Accessibility – access to data is user friendly

• Interpretability – documentation; meta-data

• Granularity – data detailed enough

• Value - Number of ways data are put to use

• Cost-effectiveness – value for moneyBiemer, P. (2017) Errors and Inference, Chapter 10 in BD and

Social Science

Page 27: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Measurement

Page 28: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Why it matters

28

“As part of Child Fatality Review, department heads in Baltimore City government get together once a month. We review every child death that happened in the city since the previous meeting. We ask what more we might have done to prevent that tragedy. In many cases, each of us has a file on the child or the family at least an inch thick. It’s tragic to compare notes after the child has died—what more could we have done when the child was alive?.”

DR. LEANA WEN, COMMISSIONER OF HEALTH, CITY OF BALTIMORE

Page 29: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Data Infrastructure

Page 30: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Legal framework

Data controlled by statistical agencies

• - Title 26

• - Title 13

• - CIPSEA

• Other frameworks

• - HIPAA

• - FERPA

• Twin pillars of anonymization and consent

Page 31: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

In Big Data Era

• Most data no longer collected by the government (internet search logs, Twitter, supermarket scanners…)

• Question how to share collected information without violating privacy guarantees becomes more relevant

Page 32: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Additional Problems

• What is the legal framework when the ownership of data is unclear?

• Collection and analysis often no longer within same entity. Ownership of data less clear.

• Who has the legal authority to make decisions about permission, access and dissemination and under what circumstances?

• The challenge in the case of big data is that data sources are often combined, collected for one purpose and used for another and users often have no good understanding of it or how their data will be used.

Page 33: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

=> Concepts Out of Date

Notification is either comprehensive or comprehensible, but not both. (Nissenbaum 2011)

Understanding of the nature of harm has diffused over time..

Consumers value their own privacy in variously flawed ways. (Acquisti 2014)

Page 34: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Case Study: Issues with Consent

Opt-in vs. opt-out wording

Gain vs. loss framing

Front vs. back placement

Page 35: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Case Studies: Issues with Anonymization

• Identity disclosure• - linkage with external available data

• Attribute disclosure

• Inferential disclosure

Page 36: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Outline

• Context

• Framework

• Risk and Utility

• Three Approaches• Historical

• Differential Privacy

• Secure Remote Access

• Vision

Page 37: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Practicalities of disclosure control

The aim of disclosure control is to ensure that no unauthorisedindividual, technically competent with public data and privateinformation could:

I) identify any information not already public knowledge with a reasonabledegree of confidence, and

2) associate that information with the supplier of the information

Page 38: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

What is disclosure?

There are three types of disclosure; Identity, Attribute and Residual.

• Identity disclosure occurs when an individual can be identified from the released output, leading to information being provided about that identified subject.

• Attribute disclosure occurs when confidential information is revealed and can be attributed to an individual. It is not necessary for a specific individual to be identified or for a specific value to be given for attribute disclosure to occur. For example, publishing a narrow range for the salary of persons exercising a particular profession in one region may constitute a disclosure.

• Residual disclosure can occur when released information can be combined to obtain confidential data.). Care must be taken to examine all output to be released. While a table on its own might not disclose confidential information, disclosure can occur by combining information from several sources, including external ones. (e.g., suppressed data in one table can be derived from other tables).

• Source: Guide for Researchers under Agreement with Statistics Canada, October 2005

Page 39: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Historical approach

1. Aggregated tabular data

2. Public use files

3. Licensing

4. Synthetic Data

5. Research Data Centers

Page 40: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Examples

• Traditional approaches –microdata

• - local suppression• - global recoding• - top coding• - sampling• - rounding• - swapping• - added noise• - data shuffling• ….

• Traditional approaches – tables• - cell suppression• - controlled tabular adjustment• - rounding• - cell perturbation

Page 41: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Outline

• Context

• Framework

• Risk and Utility

• Three Approaches• Historical

• Differential Privacy

• Secure Remote Access

• Vision

Page 42: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Differential Privacy

• Differential privacy is a rigorous mathematical definition of privacy

• An algorithm is said to be differentially private if by looking at the output, one

cannot tell whether any individual's data was included in the original dataset

or not.

• The guarantee of a differentially private algorithm is that its behavior hardly

changes when a single individual joins or leaves the dataset

• This guarantee holds for any individual and any dataset

Page 43: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

What is the DP guarantee?

Differential Privacy can be used to address John’s concerns

Page 44: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

How Do We Add Randomness

Epsilon: privacy loss parameter

Captures deviation between opt-out and real world scenario

The effect of each individual’s information on the output of the analysis

Smaller value is more privacy (0 = opt-out scenario)

Page 45: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Can Differentially Privatized Data be Used for Redistricting?

Andrew A. Beveridge, Queens College and Graduate Center CUNY and Social Explorer

Association of Public Data Users, Annual Conference, 2019. Key Bridge Marriott, Arlington VA

45

Page 46: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Redistricting Data: First Out and Most Important• Redistricting data (PL94-171) will be released for every state by

March 31, 2020

• Laws and court cases require population equality, which is measured based upon exact population counts

• Distribution of race and Hispanic status of adults and total population is released down to the roughly 11 million Census Blocks. About 6 million have some population

• Differential privacy applied to the Providence End-to-End test

• Data are unusable, since high levels of privacy applied.

46

Page 47: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Newly Released Data Demonstrates Impact on Accuracy of Data for Simple but Relevant Tabulations

• Only a few example tabulations

• No direct measure yet of margins of inaccuracy or fuzziness

• Raise serious questions about the use of differential privacy for redistricting files

• May lead to serious legal challenges and jeopardy to redistricting plans, that are based on “fuzzy data”

• May make enforcement of Voting Rights Act difficult

• No transparency regarding effects of various “privacy budgets” on accuracy

Page 48: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Implications• Need to understand whether differential private data will be accurate

enough for redistricting

• Use of results from limited variables from 1940 not encouraging

• Need to understand level of error for various tabulations that are used in redistricting

• Relation of tabulations to nesting and totals

• Relations of categories to totals

• Serious possibility of legal intervention in contested redistricting cases.

• Privacy Budget Implications beyond PL94-171.

• SF1 and SF2 (if there is going to be one) will be affected

Page 49: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Outline

• Context

• Framework

• Risk and Utility

• Three Approaches• Historical

• Differential Privacy

• Secure Remote Access

• Vision

Page 50: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Approach: Safe Data Strategy

● Safe People○ Approved and trained researchers

● Safe Projects ○ Approved projects, consistent with agency mission

● Safe Settings○ Secure environment, GovCloud, FedRamp Moderate

● Safe Data● Deidentified Data

● Safe Outputs○ Disclosure reviews and export controls

= SAFE USE

Page 51: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

How to build in this decade?

Page 52: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

ECB-UNRESTRICTED

Security ModuleFedRAMP security certified

Data in cloudAlternative: local servers

The ADRF approach

Data user

Data producer

Metadata

Training Module

MetadataData

Data analysisCodeCollaboration

DocumentationModule

Explorer links metadata, codes, tools, publications

Collaboration Module

Interactive chat and code sharing

Workspace and tools

Stewardship Module

Approval workflow, monitoring, reporting

Usage Feedback

Data steward

Access WorkflowsMonitoring

Reporting

• ADRF for INEXDA proposed by Julia Lane (New York University) ECB-RESTRICTED

Page 53: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Data Security• Approaches that are community led and that build value

• Frameworks that establish adoptable approaches for the secure handling of essential data

• Processes that are built on an overarching strategy

Page 54: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity
Page 55: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Also• Homomorphic Encryption https://www.youtube.com/watc h?v=vUtyuw7YLVM

• Testimony to Commission of Evidence Based Policy https://www.cep.gov/

• Recent Gates Foundation funded workshop (in The ANNALS of the American Academy of Political and Social Science)

• http://policydatainfrastructure.com/author-contributions.html

• The modernization of statistical disclosure limitation at the U.S. Census Bureau

• Aref N. Dajani1, Amy D. Lauger1, Phyllis E. Singer1, Daniel Kifer2, Jerome P. Reiter3, AshwinMachanavajjhala4, Simson L. Garfinkel1, Scot A. Dahl6, Matthew Graham7, Vishesh Karwa8, Hang Kim9, Philip Leclerc1, Ian M. Schmutte10, William N. Sexton11, Lars Vilhuber7, 11, and John M. Abowd5

• An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices; John M. Abowd and Ian M. Schmutte

Page 56: Privacy, Confidentiality and Ethics...•Relevance –data satisfy user needs •Accessibility –access to data is user friendly •Interpretability –documentation; meta-data •Granularity

Key ideas

• New types of data => enormous opportunity for public good

• Three issues for privacy research• Access is critical for measurement and policy

• Understand utility, risk and tradeoff

• Protect output

• Important and difficult agenda