sharing confidential data in icpsr

32
Sharing Confidential Data George Alter University of Michigan

Upload: australiannationaldataservice

Post on 08-Apr-2017

29 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Sharing Confidential Data in ICPSR

Sharing Confidential Data

George AlterUniversity of Michigan

Page 2: Sharing Confidential Data in ICPSR

Disclosure: Risk & Harm• What do we promise when we conduct

research about people? – That benefits (usually to society) outweigh risk

of harm (usually to individual)– That we will protect confidentiality

• Why is confidentiality so important?– Because people may reveal information to us

that could cause them harm.– Examples: criminal activity, antisocial activity,

medical conditions...

Page 3: Sharing Confidential Data in ICPSR

Who are We Afraid of?• Parents trying to find out if their child had an

abortion or uses drugs• Spouse seeking hidden income or infidelity in

a divorce• Insurance companies seeking to eliminate

risky individuals• Other criminals and nuisances• NSA, CIA, FBI, KGB, SABOT, SBL, SMERSH,

KAOS, etc...

Page 4: Sharing Confidential Data in ICPSR

What are We Afraid of...• Direct Identifiers

– Inadvertent release of unnecessary information (Name, phone number, SSN…)

– Direct identifiers required for analysis (location, genetic characteristics,…)

• Indirect Identifiers– Characteristics that identify a subject when

combined (sex, race, age, education, occupation)

Page 5: Sharing Confidential Data in ICPSR

Deductive Disclosure• A combination of characteristics could

allow an intruder to re-identify an individual in a survey “deductively,” even if direct identifiers are removed.

• Dependent on– Knowing someone in the survey– Matching cases to a database

Page 6: Sharing Confidential Data in ICPSR

Deductive Disclosure

Contextual data increases the risk of disclosure – Some attributes can be known by an outsider (age,

race) – Individuals are more identifiable in smaller populations

• The more specific the geography, the more attention must be paid to disclosure risk.

Page 7: Sharing Confidential Data in ICPSR

Contextual data in social science researchGeographic context

• Neighborhood characteristics, economic conditions, health services, distance to resources, etc.

Institutional context• School• Hospital• Prison

Page 8: Sharing Confidential Data in ICPSR

Current Survey Designs Increase the Risks of Disclosing Subjects’ Identities

• Geographically referenced data• Longitudinal data• Multi-level data:

– Student, teacher, school, school district– Patient, clinic, community

Page 9: Sharing Confidential Data in ICPSR

Protecting Confidential Data• Safe data: Modify the data to reduce the risk

of re-identification

• Safe projects: Reviewing research designs

• Safe settings: Physical isolation and secure technologies

• Safe people: Training and Data use agreements

• Safe outputs: Results are reviewed before being released to researchers

Page 10: Sharing Confidential Data in ICPSR

Safe data

Disclosure risks can be reduced by:• Multiple sites rather than single locations• Keeping sampling locations secret

– Releasing characteristics of contexts without providing locations

• Oversampling rare characteristics

Page 11: Sharing Confidential Data in ICPSR

Safe Data

Data masking• Grouping values• Top-coding• Aggregating geographic areas• Swapping values• Suppressing unique cases• Sampling within a larger data collection• Adding “noise”• Replacing real data with synthetic data

Page 12: Sharing Confidential Data in ICPSR

Safe Projects• Research plans are reviewed before access is

approved• Levels of project review

1. Does the research plan require confidential data?

2. Would the research plan identify individual subjects?

3. Is the research scientifically sound? Does it “serve the public good”? • Scientific review requires standards and expertise

Page 13: Sharing Confidential Data in ICPSR

Safe Settings

• Data protection plans• Remote submission and execution• Virtual data enclave• Physical enclave

Page 14: Sharing Confidential Data in ICPSR

Data Protection Plans should address risks:• unauthorized use of account on computer• computer break-in by exploiting vulnerability• hijacking of computer by malware or botware• interception of network traffic between computers• loss of computer or media• theft of computer or media• eavesdropping of electronic output on computer screen• unauthorized viewing of paper outputWe often focus too much on technology and not enough on risk.

Safe Settings

Page 15: Sharing Confidential Data in ICPSR

Improving Data Security Plans• Problems

– PIs lack technical expertise– Requirements are inconsistent and confusing– Monitoring compliance is expensive

• An alternative: Institution-level data security protocols– Tiered guidelines for different levels of risk– Focus on mitigating risks not specifying technologies– Certification of researchers– Institutional oversight

Page 16: Sharing Confidential Data in ICPSR

• Remote submission and execution– User submits program code or scripts, which

are executed in a controlled environment• Virtual data enclave

– Remote desktop technology prevents moving data to user’s local computer

– Requires a data use agreement• Physical enclave

– Users must travel to the data

Safe Settings

Page 17: Sharing Confidential Data in ICPSR

Virtual Data Enclave

Page 18: Sharing Confidential Data in ICPSR

The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.

Page 19: Sharing Confidential Data in ICPSR

Safe people

• Data use agreements• Training

Page 20: Sharing Confidential Data in ICPSR

Safe people• Parts of a data use agreement at ICPSR

– Research plan– IRB approval– Data protection plan– Behavior rules– Security pledge– Institutional signature

Page 21: Sharing Confidential Data in ICPSR

Informed Consent

Interview

Data producer

Data archive

Researcher

Data Use Agreement

Institution

Data flowDa

ta fl

ow Data Dissemination

AgreementResearch

Plan

IRB Approval

Data Protection

PlanData flow

Page 22: Sharing Confidential Data in ICPSR

Data Use Agreement: Behavior rules To avoid inadvertent disclosure of persons, families, households, neighborhoods, schools or health services by using the followingguidelines in the release of statistics derived from the dataset.

1. In no table should all cases in any row or column be found in asingle cell.2. In no case should the total for a row or column of a cross-tabulation be fewer than ten.3. In no case should a quantity figure be based on fewer than ten cases.4. In no case should a quantity figure be published if one casecontributes more than 60 percent of the amount.5. In no case should data on an identifiable case, or any of the kindsof data listed in preceding items 1-3, be derivable through subtractionor other calculation from the combination of tables released.

Page 23: Sharing Confidential Data in ICPSR

Data Use Agreement

The Recipient Institution will treat allegations, by NAHDAP/ICPSR or other parties, of violations of this agreement as allegations of violations of its policies and procedures on scientific integrity and misconduct. If the allegations are confirmed, the Recipient Institution will treat the violations as it would violations of the explicit terms of its policies on scientific integrity and misconduct.

Page 24: Sharing Confidential Data in ICPSR

Problems with DUAs• DUAs are issued by project.

– Every PI gets a new DUA, even if the Institution has already signed the DUA for someone else

• Language and conditions in DUAs are not standard– Frequent negotiations and lawyering

Page 25: Sharing Confidential Data in ICPSR

Reducing the costs of DUAs

• Institution-wide agreements– One agreement per institution, not per project– A designated “data steward” adds qualified

researchers to the agreement– Example: Databrary Agreement

• Covers informed consent, data sharing, data use• Researcher certification covering multiple

datasets

Page 26: Sharing Confidential Data in ICPSR

Disclosure: Graph with extreme values example

no

Arrested in last year?

yes

Data were collected for a sample of 104 people in a county. Among the variables collected were age, gender, and whether the person was arrested within the last year. Box plots below show the distribution of age, one plot for those arrested and one for those who were not. The number labels are case number in the dataset. The potential identifiability represented by outlying values is compounded here by an unusual combination that could probably be identified using public records for a county in the U.S. --someone approximately 90 years old was arrested in the sample. Including extreme values is a disclosure risk for identifiability when combined with other variables in the dataset.

N 104min age 12max age 95mean age 51std dev 15% female 5.2% arrested 5.8

Safe People: Disclosure risk online tutorial

Page 27: Sharing Confidential Data in ICPSR

• Controlled environments allow review of outputso Remote submission and executiono Virtual data enclaveso Physical enclaves

• Disclosure checks may be automated, but manual review is usually necessary

Safe outputs

Page 28: Sharing Confidential Data in ICPSR

Weighing Costs and Benefits

• Data protection has costs– Modifying data affects analysis– Access restrictions impose burdens on researchers

• Protection measures should be proportional to risks– Probability that an individual can be (re-)identified– Severity of harm resulting from re-identification

Page 29: Sharing Confidential Data in ICPSR

Gradient of Risk & Restriction

Seve

rity

of H

arm

Probability of Disclosure

Tiny RiskWeb

Access

Some Risk Data Use

Agreement

Moderate Risk- Strong DUA &

Technology Rules

High Risk Enclosed Data

Center

Simple Data: minimal harm & very low

chance of disclosure

Complex Data: low harm & low probability of disclosure

Complex data: moderate harm & re-identifiable with difficulty

High severity of harm & highly

identifiable

Page 30: Sharing Confidential Data in ICPSR

Thank youGeorge Alter

University of [email protected]

Page 31: Sharing Confidential Data in ICPSR

What if databases could send data to a trusted third party, who would compute statistics?

Database 1 Database 2

Secure Multi-Party Computing

MPC does this without the third party.

Encryption

Page 32: Sharing Confidential Data in ICPSR

Average IncomeThree people with true salaries S1, S2, S3, which they never reveal.Each computes random numbers Rij (sent from i to j). Report salary plus own random numbers minus those received, i.e.,

X1 = S1 + (R12 + R13) – (R21 + R31)X2 = S2 + (R21 + R23) – (R12 + R32)

+ X3 = S3 + (R31 + R32) – (R13 + R23) Σ = S1 + S2 + S3

Example from Daniel Goroff, Alfred P. Sloan Foundation

Homomorphic Encryption