sharing health research data
Post on 21-Jun-2015
302 Views
Preview:
DESCRIPTION
TRANSCRIPT
SHARING HEALTH RESEARCH DATA
De-identificationMETHODS & EXPERIENCES
Dr. Khaled El EmamElectronic Health Information Laboratory
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Motivations for De-identification• Obtaining patient consent/authorization – not
practical for large databases and introduces bias
• Compliance to regulations / legislation
• Contractual obligations• Maintain public / consumer /
client trust• Costs of breach notification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
A Balance
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Health information that does not identify an individual and with respect to which there is
no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable
health information.
Definition of De-identified Data
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• Just to clear this issue out at the beginning• There are some claims that health data is easy to
re-identify• Often examples are used to support that argument• The evidence does not support these claims
– When data are de-identified properly the probability of a successful re-identification attack is very small
• Let’s consider a few highly publicized examples
Re-identification Attacks
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• AOL releases search queries replacing usernames with pseudonyms
• New York Times reporters re-identify one user 4417749
• Her search terms: “tea for good health”, “numb fingers”, “hand tremors”, “dry mouth”, “60 single men”, “dog that urinates on everything”, “landscapers in Lilburn, Ga”, “homes sold in shadow lake subdivision gwinnett county georgia”
• Thelma Arnold, widow living in Lilburn Ga ; she has three dogs
AOL
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• It is well known that a large percentage of individuals run ‘vanity’ searches that include their names – Thelma Arnold did
• It is also known that location information can be determined from an individual’s search queries
• Search queries, even if the username is replaced with a pseudonym, cannot be considered de-identified
AOL ?
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• Governor Weld of Massachusetts was unwell during a public appearance – the story was covered in the media
• Semi-publicly available insurance claims data matched with voter registration lists
• It was possible to determine which claims records belonged to the Governor
Weld
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• This re-identification attack was done before HIPAA came into effect – the insurance claims data would not pass any of the HIPAA de-identification standards
• A recent analysis indicated that Weld was likely re-identified because he was a famous person and there was already a lot of information about him in the media (his admission date, his diagnosis, his discharge date) – the voter registration list was arguably not necessary
• The success rate for such an attack would be lower for general members of the public because the voter registration list is incomplete
Weld ?
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• Netflix publicly released movie ratings data in the context of a competition to develop a recommendation algorithm
• Researchers re-identified a couple of records by matching with a publicly available and identifiable movie ratings database (IMDB)
• Results in cancellation of a second competition and litigation started against Netflix for exposing personal information
Netflix
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• The re-identifications were not actually verified by Netflix
• Authors of attack admit that the Netflix data was not de-identified (replaced usernames with pseudonyms)
• The false positive rate of the matching was not evaluated (how many people in the IMDB database were actually in the Netflix database ?)
Netflix ?
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• Attribute disclosure: discover something new about an individual in the database without knowing which record belongs to that individual
• Identity disclosure: determine which record in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith – that is identity disclosure)
• HIPAA only cares about identity disclosure
Attribute vs Identity Disclosure
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Statistically significant relationship (chi-square, p<0.05)
High risk of attribute disclosure
Attribute vs Identity Disclosure
HPV Vaccinated NOT HPV Vaccinated
Religion A 5 40
Religion B 40 5
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Statistically significant relationship (chi-square, p<0.05)
High risk of attribute disclosure
Attribute vs Identity Disclosure
HPV Vaccinated NOT HPV Vaccinated
Religion A 5 40
Religion B 40 5
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
After suppression Not statistically significant relationship (chi-square) Low risk of attribute disclosure
Attribute vs Identity Disclosure
HPV Vaccinated NOT HPV Vaccinated
Religion A 5 6
Religion B 6 5
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Stigmatizing Analytics
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Health information that does not identify an individual and with respect to which there is
no reasonable basis to believe that the information can be used to identify an
individual
Definition of De-identified Data
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Direct Identifiers• Fields that would uniquely identify individuals
in a database• Name, address, telephone number, fax
number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Dealing with Direct Identifiers• Defensible approaches:
– Remove those fields– Convert them to one-time or persistent
pseudonyms– Randomize the values
• These approaches will ensure, if done properly, that the probability of recovering the original value is very small
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Quasi-Identifiers• sex, date of birth or age, geographic locations (such
as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Risk Measurement• Risk measurement will depend on:
– Granularity of quasi-identifiers– Region of the country we are talking about– Risk metric used (eg, uniqueness or groups of 5)– Threshold for what is acceptable risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identification Standards• The HIPAA Privacy Rule specifies two de-
identification standards (45 CFR 164.514):– Safe Harbor– Statistical method (also known as the expert
statistician method)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names2. ZIP Codes (except
first three)3. All elements of dates
(except year)4. Telephone numbers5. Fax numbers6. Electronic mail
addresses7. Social security
numbers8. Medical record
numbers9. Health plan
beneficiary numbers10.Account numbers11. Certificate/license
numbers
HIPAA Safe Harbor
12.Vehicle identifiers and serial numbers, including license plate numbers
13.Device identifiers and serial numbers
14.Web Universal Resource Locators (URLs)
15. Internet Protocol (IP) address numbers
16.Biometric identifiers, including finger and voice prints
17.Full face photographic images and any comparable images;
18. Any other unique identifying number, characteristic, or code
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names2. ZIP Codes (except
first three)3. All elements of dates
(except year)4. Telephone numbers5. Fax numbers6. Electronic mail
addresses7. Social security
numbers8. Medical record
numbers9. Health plan
beneficiary numbers10.Account numbers11. Certificate/license
numbers
HIPAA Safe Harbor
12.Vehicle identifiers and serial numbers, including license plate numbers
13.Device identifiers and serial numbers
14.Web Universal Resource Locators (URLs)
15. Internet Protocol (IP) address numbers
16.Biometric identifiers, including finger and voice prints
17.Full face photographic images and any comparable images;
18. Any other unique identifying number, characteristic, or code
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Two Problems with Safe Harbor• May be removing too much information on
the ZIP Code and date fields – these fields are useful for many analytical purposes
• Does not provide adequate protection – it is easy to have a Safe Harbor compliant data set with a high risk of re-identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
High Risk Safe Harbor Data - I• If the adversary knows that Bob, 55 year old
male, is in the database
Gender Age ZIP Lab Test
M 55 112 Albumin, Serum
F 53 114Alkaline
Phosphatase
M 24 134 Creatine Kinase
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
High Risk Safe Harbor Data - II• 2.24m visits, 1.6m patients, NY discharge
data for 2007• Compliant with Safe Harbor
Fields % of patients unique
age, gender, ZIP3 2.54%
age, gender, ZIP3, LOS 21.49%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Statistical Method Conditions• A person with appropriate knowledge of and
experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:I. Applying such principles and methods, determines that
the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and
II. Documents the methods and results of the analysis that justify such determination
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Risk Spectrum
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Managing Re-identification Risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Different Types of Data Releases• The same data set can be disclosed
with different thresholds:– Public data set– Release with conditions for known data
recipients, including the requirement to sign a data sharing agreement, a prohibition on re-identification, and a requirement to pass these conditions to all sub-contractors
– The more conditions the higher quality the data set
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Example – CA Hospital Discharges• Context: data release to a researcher who will sign a
data use agreement, good practices for managing sensitive health information
• There were ~2.1m patients who had ~3m visits• Risk threshold = 0.2; use average risk across all
patients• Variables:
– Year of birth– Gender– Year of admission– Days since last visit– Length of stay
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Risk Level
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Hierarchy
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identified Data
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Key Practical Considerations• Data warehouses: de-identification of data extracts
instead of whole data warehouses results in higher quality de-identified data
• Beware of correlated data: data in multiple medical domains are correlated, so one has to be cognizant of inference attacks on data
• Automation: automation can detect outliers and perform selective suppression, which results in higher quality de-identified data
• Transparency: important to ensure that methods have received peer and regulator scrutiny
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Contact
kelemam@ehealthinformation.ca
@kelemam
www.ehealthinformation.ca
top related