panel: automatic clinical text de-identification: is it ... · kvinna med hjrtsvikt,...

105
Panel: Automatic Clinical Text De-Identification: Is It Worth It, and Could It Work for Me? Hercules Dalianis Clinical Text Mining Group Department of Computer and Systems Sciences (DSV) [email protected]

Upload: others

Post on 14-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Panel: Automatic Clinical Text De-Identification: Is It Worth It, and Could It Work for Me?

    Hercules Dalianis Clinical Text Mining Group Department of Computer and Systems Sciences (DSV) [email protected]

  • Background

    •  Starting 2007 •  Karolinska University Hospital, Stockholm •  Greater Stockholm (City Council) 2 million inhabitants •  1800 beds/inpatients •  550 clinical units

    Hercules Dalianis, MEDINFO 2013 2

  • TakeCare EPR system

    •  Swedish electronic patient record system, now owned by CompuGroup Medical

    •  Centralized, text file based •  Built on APL programming language •  Data transferred to MySQL database to make it

    manageable (Intelligence)

    Hercules Dalianis, MEDINFO 2013 3

  • Ethical permission

    •  What type of research will be carried out •  How will it be carried out •  No social security number •  No personal names •  Safe guard of data

    Hercules Dalianis, MEDINFO 2013 4

  • Encryption and safe guard

    •  Encrypted server •  Password protected •  Locked into an alarmed room •  Server locked to a rack •  No Internet connection •  Few people have access to this server (that have

    to sign security paper) => Probably safer than at the hospital

    Hercules Dalianis, MEDINFO 2013 5

  • Trust, Trust and more Trust •  Good contacts with hospital management •  They decide for the whole hospital/all clinical units •  No psychiatric or veneric diseases, no paperless refugees

    Hercules Dalianis, MEDINFO 2013 6

  • •  We obtained 1 million patient records from 550 clinical units from the year 2006-2010

    •  In several extracts that also continue •  Each patient have an unique social security

    number, from birth to dead Replaced by a serial number

    •  All patient names removed •  The rest including sensitive text is present

    Hercules Dalianis, MEDINFO 2013 7

    Stockholm EPR Corpus

  • DEID work

    •  Yes, we did it also to obtain an overview of what problems may occur

    •  We followed HIPAA*) but adapted it for Swedish conditions

    *) Health Insurance Portability and Accountability Act

    Hercules Dalianis, MEDINFO 2013 8

  • Hercules Dalianis

    The Stockholm EPR PHI*) corpus

    •  100 electronic patient records (EPRs) in Swedish

    •  Five clinics: Neurology, Orthopaedia, Infection, Dental Surgery and Nutrition

    •  20 patients from each clinic, 50% men, 50% women •  380 000 tokens •  Three annotators annotated the whole corpus

    *) Protected Health Information 9

  • Hercules Dalianis 10

    28 PHI-classes

    •  Account_Number, Age, Age_Over_89, Biometric_Identifier, Date_Part, Full_Date, Year,

    First_Name, Last_Name, Patient_First_Name,

    Patient_Last_Name, Relative_First_Name,

    Relative_Last_Name, Clinician_First_Name,

    Clinician_Last_Name, Location, Country, Municipality,

    Organization, Street_Address, Town, Health_Care_Unit,

    Device_Identifier_and_Serial_Number, Ethnicity,

    Fax_Number, Phone_Number, Relation, Uncertain

  • Hercules Dalianis 11

  • Consensus eight annotation classes

    •  Age •  Date_Part •  Full_Date •  First_Name •  Last_Name, •  Health_Care_Unit •  Location •  Phone_Number

    Hercules Dalianis 12

  • Annotation classes and instances

    •  Age 56 •  Full date 710 •  Date part 500 •  First name 923 •  Last name 928 •  Location 1 021 •  Health care unit 148 •  Phone number 135 Sum: 4 421

    Hercules Dalianis 13

  • •  380 000 tokens •  4 421 sensitive instances •  ~ 1 percent sensitive information

    Hercules Dalianis 14

  • Eight annotation classes training and test using Stanford NER-CRF

    Hercules Dalianis 15

  • •  0.95-0.74 precision, •  0.83-0.36 recall •  0.90-0.49 F-score •  The 8 annotation classes and the words •  The rest is Black box

    –  Window breadth –  Distance between words etc

    Hercules Dalianis 16

    Conditional Random fields à la Stanford NER

  • Research on Stockholm EPR Corpus

    •  DEID and Resynthesis •  Factuality level detection of diagnoses •  Negation detection •  Detecting the amount of hospital-acquired

    infections (HAI) •  Detection of adverse drug events •  Comorbidities

    Hercules Dalianis, MEDINFO 2013 17

  • Conclusion

    •  Preferably to work on original data •  Too costly and difficult to de-identify data •  Not safe enough •  De-identification makes the data too noisy.

    Hercules Dalianis, MEDINFO 2013 18

  • References

    •  Velupillai, S., H. Dalianis, M. Hassel and G. H. Nilsson. 2009. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. International Journal of Medical Informatics (2009), doi:10.1016/j.ijmedinf.2009.04.005

    •  Dalianis, H. and S. Velupillai. 2010. De-identifying Swedish Clinical Text - Refinement of a Gold Standard and Experiments with Conditional Random Fields, Journal of Biomedical Semantics 2010, 1:6 (12 April 2010)

    Hercules Dalianis, MEDINFO 2013 19

  • •  Alfalahi, A., S. Brissman and H. Dalianis. 2012. Pseudonymisation of person names and other PHIs in an

    annotated clinical Swedish corpus. In the Proceedings of the

    Third Workshop on Building and Evaluating Resources for

    Biomedical Text Mining (BioTxtM 2012) held in conjunction

    with LREC 2012, May 26, Istanbul, pp 49-54

    Hercules Dalianis, MEDINFO 2013 20

  • Comorbidities in Comorbidity-view

    •  Which ICD-10 codes co-occur with which other ones

    Hercules Dalianis 21

  • Hercules Dalianis 22

    Comorbidity View

  • Hercules Dalianis 23

  • Hercules Dalianis 24

  • Hercules Dalianis 25

    123 H - IVA 322916614D 2007-08-21 9:12 1944 Kvinna Anamnesis Kvinna med hjrtsvikt, förmaksflimmer, angina pectoris. Ensamstående änka. Tidigare CVL med sequelae högersidig hemipares och afasi. Tidigare vårdad för krampanfall misstänkt apoplektisk. Inkommer nu efter att ha blivit hittad på en stol och sannolikt suttit så över natten. Inkommer nu för utredning. Sonen Johan är med.

    Example record (Anonymized manually)

  • 23 H - IVA 322916614D 2008-08-21 10:54 1944 Kvinna Bedömning Grav hjärtsvikt efter hjärtinfarkt x 2 inklusive eoisod med asystoli och HLR. EF 20-25%. Neurologisk påverkan med hösidig svaghet. Blodprov. Odlingar tas i blod och urin. Remiss skickas pulm-rtg enl dr Svenssons anteckning. Atelektaser. Pneumoni, I110. Hjärtinsufficiens, ospecificerad, I509

    Hercules Dalianis 26

  • Hercules Dalianis 27

    (English translation) 123 H - IVA 322916614D 2008-08-21 9:12 1944 Woman Anamnesis

    Woman with hert failures, atrial fibrillation, and angina pectoris. Single widow. Former CVL with sequele, rght hemiparesis and aphasia. Prior hospital care for seizures, suspected to be apoepeleptic. Arrive to hospital after being found in a chair and probably been sitting there over night. Arrive for further investigation and care. Accompanied by her son Johan.

  • Hercules Dalianis 28

    123 H - IVA 322916614D 2008-08-21 10:54 1944 Woman Assessment/Plan Severe heart failure after heart infarction x 2. including episode with heart arrest and acute heart arrest treatment. Ejection fracture (EF) 20-25%. Neurological symptoms with right sided hemiparesis. Blood samples. Culture for blood and urine. Referral for pulmonary x-ray according to dr Svensson’s notes. Atelectases. Pneumonia, I110. Heart failure, unspecified, I509.

  • Automatic Clinical Text De-Identification: Is It Worth It, and

    Could It Work for Me?

    Stéphane M. Meystre Biomedical Informatics, University of Utah, USA Hercules Dalianis Computer and Systems Sciences, Stockholm University, Sweden

    Pierre Zweigenbaum ILES, LIMSI-CNRS, France

    Medinfo 2013Copenhagen, August 23, 2013

  • De-identificationPrivacy and confidentiality of clinical dataIn the U.S., the HIPAA (Health Insurance Portability and Accountability Act) protects the confidentiality of patient data.The Common Rule protects the confidentiality of research subjects. These laws typically require the informed consent of the patient and approval of the IRB to use data for research purposes, but these requirements are waived if data are de-identified.

    De-identification means that explicit identifiers are hidden or removed.Often used interchangeably with anonymization, but the latter implies that the data cannot be linked to identify the patient (i.e., de-identified is often far from anonymous). Scrubbing is also sometimes used as a synonym of de-identification.

  • De-identificationPrivacy and confidentiality of clinical dataIn the U.S., the HIPAA (Health Insurance Portability and Accountability Act) protects the confidentiality of patient data.The Common Rule protects the confidentiality of research subjects. These laws typically require the informed consent of the patient and approval of the IRB to use data for research purposes, but these requirements are waived if data are de-identified.

    De-identification means that explicit identifiers are hidden or removed.Often used interchangeably with anonymization, but the latter implies that the data cannot be linked to identify the patient (i.e., de-identified is often far from anonymous). Scrubbing is also sometimes used as a synonym of de-identification.

  • De-identification (cont.)According to the HIPAA, the Safe Harbor Methodology requires the following PHI to be removed:

    1. Names2. All geo-subdivisions smaller

    than a State3. All elements of dates

    (except year)4. Phone numbers5. Fax numbers6. Electronic mail addresses7. Social Security numbers8. Medical record numbers9. Health plan beneficiary

    numbers

    10.Account numbers11.Certificate/license numbers12.Vehicle identifiers and serial numbers13.Device identifiers and serial numbers14.Web Universal Resource Locators15.Internet Protocol address numbers16.Biometric identifiers, including finger and

    voice prints17.Full face photographic images and any

    comparable images18.Any other unique identifying number,

    characteristic, or code

  • De-identification (cont.)Manual text de-identification is a lengthy and costly process (about 90 s per document).

    NLP can be used to automatically de-identify electronic clinical documents.

    Several NLP-based applications have been developed for clinical text de-identification, but:

    • they are developed for one or a few clinical note types,• in a specific institution or specialty,• to detect and remove/hide certain categories of PHI only...Overall, their generalizability is a problem, but a problem that can be improved.

  • De-identification (cont.)Manual text de-identification is a lengthy and costly process (about 90 s per document).

    NLP can be used to automatically de-identify electronic clinical documents.

    Several NLP-based applications have been developed for clinical text de-identification, but:

    • they are developed for one or a few clinical note types,• in a specific institution or specialty,• to detect and remove/hide certain categories of PHI only...Overall, their generalizability is a problem, but a problem that can be improved.

  • PresentersHercules Dalianis, PhD

    Professor in Computer and Systems Sciences, at the Stockholm University, Sweden.De-identifying Swedish health records

    Pierre Zweigenbaum, PhD

    Director of Research at the CNRS, in the LIMSI, Orsay, France.De-identification of French clinical records

    Stéphane Meystre, MD, PhD

    Assistant Professor in Biomedical Informatics, at the University of Utah, USA.De-identification of clinical documents at the U.S. VHA, and issues related with de-identification (impact, risk for re-identification)

  • Automatic VHA Clinical Text De-Identification

    Stéphane M. MeystreBiomedical Informatics, University of Utah

    Medinfo 2013Copenhagen, August 23, 2013

  • VA clinical data de-identificationVA Center for Healthcare Informatics Research (CHIR) de-identification project:

    National project to advance the methodology for automated de-identification of patient data with a systematic approach of evaluating existing de-identification systems, exploring innovative methods and techniques for de-identification, and combining the best-performing ones in a best-of-breed application.

    Also includes the evaluation of the level of anonymity of de-identified clinical notes, and the impact of text de-identification on subsequent uses of the clinical notes.

  • VA clinical data de-identificationVA Center for Healthcare Informatics Research (CHIR) de-identification project:

    National project to advance the methodology for automated de-identification of patient data with a systematic approach of evaluating existing de-identification systems, exploring innovative methods and techniques for de-identification, and combining the best-performing ones in a best-of-breed application.

    Also includes the evaluation of the level of anonymity of de-identified clinical notes, and the impact of text de-identification on subsequent uses of the clinical notes.

  • Existing data de-identification evaluationLiterature review of related publications:Large variety of PHI categories detected

  • Existing data de-identification evaluationLiterature review of related publications:Large variety of PHI categories detected

  • Existing data de-identification evaluationLiterature review of related publications:Large variety of PHI categories detected

  • Existing data de-identification evaluationLiterature review of related publications:Large variety on methods used.

  • "Out-of-the-box" evaluation:

    Text de-identification systems- Rule-based systems:

    HMS Scrubber (Beckwith et al., 2006);•

    MeDS (Friedlin and McDonald, 2008); and•

    MIT deid system (Neamatullah et al., 2008).

    - Machine learning-based systems: •

    MITRE Identification Scrubber Toolkit (MIST) (Aberdeen et

    al., 2010) •

    Health Information DE-identification (HIDE) system

    (Gardner and Xiong, 2009).

    Traditional NER system• Stanford NER system (Finkel et al., 2005)

    Existing data de-identification evaluation

  • "Out-of-the-box" evaluation (cont.):

    Training:- Rule-based systems run "out-of-the-box"

    - Machine learning-based systems trained with other corpus of 225 randomly selected VHA clinical documents, manually annotated for PHI (names).

    - Stanford NER system run with trained models available with its distribution.

    Testing with corpus of 50 randomly selected VHA clinical documents, manually annotated for PHI (names).

    Existing data de-identification evaluation

  • "Out-of-the-box" evaluation results:

    System Precision Recall F2-measure

    HMS Scrubber

    MeDS

    MIT deid

    MIST

    HIDE

    Stanford NER

    0.150 0.675 0.397

    0.149 0.768 0.419

    0.636 0.893 0.826

    0.865 0.319 0.356

    0.975 0.376 0.429

    0.692 0.723 0.716

    Existing data de-identification evaluation

  • "Out-of-the-box" evaluation results:

    System Precision Recall F2-measure

    HMS Scrubber

    MeDS

    MIT deid

    MIST

    HIDE

    Stanford NER

    0.150 0.675 0.397

    0.149 0.768 0.419

    0.636 0.893 0.826

    0.865 0.319 0.356

    0.975 0.376 0.429

    0.692 0.723 0.716

    Existing data de-identification evaluation

  • "Out-of-the-box" evaluation results:

    System Precision Recall F2-measure

    HMS Scrubber

    MeDS

    MIT deid

    MIST

    HIDE

    Stanford NER

    0.150 0.675 0.397

    0.149 0.768 0.419

    0.636 0.893 0.826

    0.865 0.319 0.356

    0.975 0.376 0.429

    0.692 0.723 0.716

    Existing data de-identification evaluation

  • "Out-of-the-box" evaluation results:

    System Precision Recall F2-measure

    HMS Scrubber

    MeDS

    MIT deid

    MIST

    HIDE

    Stanford NER

    0.150 0.675 0.397

    0.149 0.768 0.419

    0.636 0.893 0.826

    0.865 0.319 0.356

    0.975 0.376 0.429

    0.692 0.723 0.716

    Existing data de-identification evaluation

  • Our "best-of-breed" approach (BoB)

    Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.

  • Our "best-of-breed" approach (BoB)

    Pre-processing

    Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.

  • Our "best-of-breed" approach (BoB)

    Pre-processing

    High-sensitivity extraction component

    Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.

  • Our "best-of-breed" approach (BoB)

    Pre-processing

    High-sensitivity extraction component

    False positives filtering component

    Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. JAMIA. 2012 Sep 4.

  • NLP pre-processing:• Sentence segmentation (adapted from OpenNLP and

    retrained models with VHA clinical text)• Tokenization• Part-of-speech tagging (adapted from OpenNLP and cTAKES

    trained models)• Phrase chunking (adapted from OpenNLP and cTAKES

    trained models)• LVG normalization (NLM development)

    Our "best-of-breed" approach

  • High-sensitivity extraction component:Mostly based on rules (context keywords and regex patterns) and dictionary lookups (Lucene with common English words and frequently occurring names from the 1990 U.S. Census).

    Our "best-of-breed" approach

  • High-sensitivity extraction component:Mostly based on rules (context keywords and regex patterns) and dictionary lookups (Lucene with common English words and frequently occurring names from the 1990 U.S. Census).

    Dependent on the quality of the patterns and dictionary completenessPHI formats and instances not supported will be missed!!

    Our "best-of-breed" approach

  • High-sensitivity extraction component:Mostly based on rules (context keywords and regex patterns) and dictionary lookups (Lucene with common English words and frequently occurring names from the 1990 U.S. Census).

    Add machine learning based on sequence labeling (traditional NER tasks): Stanford coreNLP library (CRF) trained to recognize person names (using our VHA training corpus).Goal is to maximize recall, even if precision is altered.

    Dependent on the quality of the patterns and dictionary completenessPHI formats and instances not supported will be missed!!

    Our "best-of-breed" approach

  • False positives filtering component:Based on machine learning classifiers

    • Classifies candidate annotations as true or false positives• Support Vector Machine classifier (LIBSVM, RBF kernel) and various features (lexical, morphological, syntactic, and method used to detect PHI (name) candidate)

    PersonNames

    Trainingmodel

    Positive training examplescorrect annotations derived from the high-sensitivity extraction component

    Negative training examplesincorrect annotations derived from the high-sensitivity extraction component

    Our "best-of-breed" approach

  • Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):

    Evaluation of BoB

    PHI categories MITR

    Patient Name 0.590Relative Name 0.600Healthcare Provider Name

    0.319Other Person Name 0.111Street City 0.828State Country 0.689Deployment 0.057ZIP Code 1Healthcare Units 0.008Other Organizations 0.033Date 0.399Age > 89 0.250Phone Number 0.494Electronic Address 1SSN 1Other ID Number 0.117Overall macro-averaged

    0.468

    Precision 0.311Recall 0.350F1-measure 0.329F2-measure 0.341

    Ove

    rall

    µ-av

    erag

    ed

    Some PHI categories have very low recall because of missing rules/patterns or dictionary entries.

  • Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):

    Evaluation of BoB

    PHI categories MIT RulesR R

    Patient Name 0.590 0.972Relative Name 0.600 0.960Healthcare Provider Name

    0.319 0.920Other Person Name 0.111 1Street City 0.828 0.962State Country 0.689 0.953Deployment 0.057 1ZIP Code 1 1Healthcare Units 0.008 0.832Other Organizations 0.033 0.824Date 0.399 0.963Age > 89 0.250 1Phone Number 0.494 0.989Electronic Address 1 1SSN 1 1Other ID Number 0.117 0.978Overall macro-averaged

    0.468 0.960

    Precision 0.311 0.362Recall 0.350 0.928F1-measure 0.329 0.521F2-measure 0.341 0.707

    Ove

    rall

    µ-av

    erag

    ed

    Rules/patterns and dictionary entries specific to VHA clinical notes were required (e.g., date pattern for formats like ‘09/09/09@1200’), and dictionary fuzzy-matches were also added.

  • Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):

    Evaluation of BoB

    PHI categories MIT Rules CRF Rules+CRFR R R R

    Patient Name 0.590 0.972 0.953 0.992Relative Name 0.600 0.960 0.960 0.960Healthcare Provider Name

    0.319 0.920 0.898 0.963Other Person Name 0.111 1 0.667 1Street City 0.828 0.962 0.872 0.974State Country 0.689 0.953 0.757 0.973Deployment 0.057 1 -- 1ZIP Code 1 1 -- 1Healthcare Units 0.008 0.832 0.755 0.914Other Organizations 0.033 0.824 0.549 0.912Date 0.399 0.963 0.917 0.977Age > 89 0.250 1 -- 1Phone Number 0.494 0.989 -- 0.989Electronic Address 1 1 -- 1SSN 1 1 -- 1Other ID Number 0.117 0.978 -- 0.978Overall macro-averaged

    0.468 0.960 -- 0.977

    Precision 0.311 0.362 -- 0.346Recall 0.350 0.928 -- 0.961F1-measure 0.329 0.521 -- 0.509F2-measure 0.341 0.707 -- 0.709

    Ove

    rall

    µ-av

    erag

    ed

    CRFs allowed detecting PHI missing in rules/patterns or dictionaries,but added significant noise.

  • Summative evaluation with reference standard of 800 VHA clinical notes (500 training, 300 testing):

    Evaluation of BoB

    PHI categories MIT Rules CRF Rules+CRF BoB fullBoB fullR R R R R P

    Patient Name 0.590 0.972 0.953 0.992 0.980

    0.707Relative Name 0.600 0.960 0.960 0.960 0.920 0.707Healthcare Provider Name

    0.319 0.920 0.898 0.963 0.9430.707

    Other Person Name 0.111 1 0.667 1 0.888

    0.707

    Street City 0.828 0.962 0.872 0.974 0.943 0.679State Country 0.689 0.953 0.757 0.973 0.878 0.751Deployment 0.057 1 -- 1 0.887 0.859ZIP Code 1 1 -- 1 1 1Healthcare Units 0.008 0.832 0.755 0.914 0.811 0.836Other Organizations 0.033 0.824 0.549 0.912 0.725 0.578Date 0.399 0.963 0.917 0.977 0.971 0.934Age > 89 0.250 1 -- 1 1 0.8Phone Number 0.494 0.989 -- 0.989 0.956 1Electronic Address 1 1 -- 1 1 1SSN 1 1 -- 1 1 0.964Other ID Number 0.117 0.978 -- 0.978 0.917 0.831Overall macro-averaged

    0.468 0.960 -- 0.977 0.926 0.841

    Precision 0.311 0.362 -- 0.346 0.8360.836Recall 0.350 0.928 -- 0.961 0.9220.922F1-measure 0.329 0.521 -- 0.509 0.8770.877F2-measure 0.341 0.707 -- 0.709 0.9040.904

    Ove

    rall

    µ-av

    erag

    ed

  • Oscar Ferrandez Escamez (University of Utah, now Nuance)

    Brett South (University of Utah and SLC VA)

    Shuying Shen (University of Utah and SLC VA)

    Jeffrey Friedlin (Regenstrief Institute)

    Matthew Maw (SLC VA)

    Matthew Samore (University of Utah and SLC VA)

    Funding by VA HSR&D (CHIR; HIR 08-374)

    Questions and comments:

    [email protected]

    Acknowledgments

    Thank you!

    mailto:[email protected]:[email protected]

  • Quality of De-Identification, and Impact on Clinical Information

    Stéphane M. MeystreBiomedical Informatics, University of Utah

    Medinfo 2013Copenhagen, August 23, 2013

  • PHI content varies significantly between various clinical corpora:

    Generalizability of de-identification

  • PHI content varies significantly between various clinical corpora:

    Generalizability of de-identification

  • PHI content varies significantly between various clinical corpora:

    Generalizability of de-identification

  • De-identification applications tested “out-of-the-box” with our VHA corpus: low performance!

    •Rule-based systems reach 32-26% recall and 14-42% precision (fully-contained matches, one overall PHI category)

    •Machine learning-based systems reach 28-30% recall and 56-58% precision (trained with the i2b2 deid corpus)

    Generalizability of de-identification

  • The VHA training and testing corpora• Variety of clinical notes (stratified random sample)• Annotated for all HIPAA categories, some VHA-specific categories (deployment locations, units), and eponyms

    • 500 documents for training, 300 documents for testing

    The 2006 i2b2 de-identification challenge corpus• Discharge summaries from Partners Healthcare, de-identified and PHI resynthesized with "± realistic" surrogates

    • Selection of PHI categories subset of HIPAA (Patient, Doctor, Hospital, IDs, Dates, Phone numbers, Ages)

    • 669 documents for training, 220 documents for testing

    Generalizability evaluation

  • Applications training and testing:

    Train

    Test

    VHAVHAVHA

    VHAVHAVHA

    All / Some / No dictionaries*

    *Dictionaries used by MIST and HIDE

    Generalizability evaluation (cont.)

  • Applications training and testing:

    Train

    Test

    VHAVHAVHA

    VHAVHAVHA

    All / Some / No dictionaries*

    i2b2i2b2i2b2

    i2b2i2b2i2b2

    No dictionaries*

    *Dictionaries used by MIST and HIDE

    Generalizability evaluation (cont.)

  • Applications training and testing:

    Train

    Test

    VHAVHAVHA

    VHAVHAVHA

    All / Some / No dictionaries*

    i2b2i2b2i2b2

    i2b2i2b2i2b2

    No dictionaries* No dictionaries*

    i2b2i2b2i2b2

    VHAVHAVHA

    *Dictionaries used by MIST and HIDE

    Generalizability evaluation (cont.)

  • Results (VHA corpus)

    MIST* HIDE** BoB

    Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)

    Overall macro-avg(PHI-type level)

    Precision 0.926 0.933 0.836

    Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904

    Recall 0.737 0.729 0.926

    * Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries

    Generalizability evaluation (cont.)

  • Results (VHA corpus)

    MIST* HIDE** BoB

    Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)

    Overall macro-avg(PHI-type level)

    Precision 0.926 0.933 0.836

    Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904

    Recall 0.737 0.729 0.926

    * Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries

    Generalizability evaluation (cont.)

  • Results (VHA corpus)

    MIST* HIDE** BoB

    Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)

    Overall macro-avg(PHI-type level)

    Precision 0.926 0.933 0.836

    Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904

    Recall 0.737 0.729 0.926

    * Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries

    Generalizability evaluation (cont.)

  • Results (VHA corpus)

    MIST* HIDE** BoB

    Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)

    Overall macro-avg(PHI-type level)

    Precision 0.926 0.933 0.836

    Recall 0.888 0.863 0.922F1-measure 0.907 0.897 0.877F2-measure 0.895 0.877 0.904

    Recall 0.737 0.729 0.926

    * Best MIST configuration, with no dictionaries** Best HIDE configuration, with selected dictionaries

    Generalizability evaluation (cont.)

  • Results (VHA corpus)

  • Results (VHA corpus)

  • Training with our VHA corpus, and testing with the i2b2 corpus

    MIST HIDE BoB

    Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)

    Overall macro-avg(PHI-type level)

    Precision 0.705 0.712 0.691

    Recall 0.749 0.576 0.820F1-measure 0.726 0.637 0.750F2-measure 0.740 0.599 0.790

    Recall 0.610 0.461 0.664

    Results (VHA/i2b2 corpora)

    MIST and HIDE with no dictionaries

    Generalizability evaluation (cont.)

  • Training with our VHA corpus, and testing with the i2b2 corpus

    MIST HIDE BoB

    Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)

    Overall macro-avg(PHI-type level)

    Precision 0.705 0.712 0.691

    Recall 0.749 0.576 0.820F1-measure 0.726 0.637 0.750F2-measure 0.740 0.599 0.790

    Recall 0.610 0.461 0.664

    Results (VHA/i2b2 corpora)

    MIST and HIDE with no dictionaries

    Generalizability evaluation (cont.)

  • Training with our VHA corpus, and testing with the i2b2 corpus

    MIST HIDE BoB

    Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)Overall micro-avg (PHI level)

    Overall macro-avg(PHI-type level)

    Precision 0.705 0.712 0.691

    Recall 0.749 0.576 0.820F1-measure 0.726 0.637 0.750F2-measure 0.740 0.599 0.790

    Recall 0.610 0.461 0.664

    Results (VHA/i2b2 corpora)

    MIST and HIDE with no dictionaries

    Generalizability evaluation (cont.)

  • Training with our VHA corpus, and testing with the i2b2 corpus

    Results (VHA/i2b2 corpora)

  • Training with our VHA corpus, and testing with the i2b2 corpus

    Results (VHA/i2b2 corpora)

  • Training with our VHA corpus, and testing with the i2b2 corpus

    Results (VHA/i2b2 corpora)

  • i2b2 corpus and combination with VHA evaluation:• Training and testing with i2b2 corpora allows for good

    performance, even if dictionaries less useful (BoB's CRF-based NER helped here).

    • Generalizability remains an issue for all systems, when training with one corpus type, and training with another one. Not one system achieved good results (overall macro-averaged recall 46-66%).

    • BoB’s design still reaches our goal, with the highest recall among the three systems, and obtaining similar precision results.

    Generalizability evaluation (cont.)

  • Some clinical information is more likely to be mistakenly considered as PHI.

    Eponyms for example could easily be considered as person names. In our corpus, they represent various categories of clinical information:• Procedures and signs (40% of eponyms): Hartmann, Nissen,

    Roux, Whipple, Apgar, Babinski, etc.

    •Diseases (36%): Alzheimer, Addison, Asperger, Basedow, Crohn, Cushing, Graves, Hodgkin, Parkinson, Raynaud, etc.

    •Devices (18%): Adson, Foley, Kelly, Swan-Ganz, etc.•Anatomical structures (6%): Achilles, His, Langerhans, etc.

    Impact on Clinical Information

  • Overlap of 2010 i2b2 challenge concepts and BoB PHI annotations:849 concepts overlapped partly with PHI annotations, reaching an average of 1.78% of all concept annotations.

    Partial overlapPartial overlapPartial overlap

    i2b2 categories

    i2b2 annot.

    PHI overlap #

    Eponyms Overlap [%]

    Problem 19667 187 18 0.95

    Test 13833 180 41 1.30

    Treatment 14185 482 53 3.40

    Impact on Clinical Information (cont.)

  • Partial overlap details:Problem Test Treatment No match

    Clinical Eponyms 18 41 53 156Person Names 162 103 383 3074Street or City 2 1 3 433State or Country 12 12 18 905Deployment 0ZIP code 0Healthcare Unit Name 17 53 1289Other Organization Name 9 15 196Date 4 20 1 5436Age > 89 13Phone Number 153Electronic Address 0SSN 0Other ID Number 7 18 9 919Total matches 187 180 482 0No match 19466 13626 13675

    Impact on Clinical Information (cont.)

  • Partial overlap details:Problem Test Treatment No match

    Clinical Eponyms 18 41 53 156Person Names 162 103 383 3074Street or City 2 1 3 433State or Country 12 12 18 905Deployment 0ZIP code 0Healthcare Unit Name 17 53 1289Other Organization Name 9 15 196Date 4 20 1 5436Age > 89 13Phone Number 153Electronic Address 0SSN 0Other ID Number 7 18 9 919Total matches 187 180 482 0No match 19466 13626 13675

    Impact on Clinical Information (cont.)

  • Partial overlap details (cont.):Most overlap happened with Person Names annotations:

    Most frequent overlap examples:Person Names - Treatment: Colace, Lopressor, Senna, Contin...Person Names - Problem: MR, E.Coli, Pseudomonas, Addison...Person Names - Test: Apgars, Papanicolaou, SP Stickney, Hct...

    PHI i2b2 categ. Overlap %Person Names Treatment 45.11Person Names Problem 19.08Person Names Test 12.13

    Impact on Clinical Information (cont.)

  • Partial overlap details (cont.):Most overlap happened with Person Names annotations:

    Most frequent overlap examples:Person Names - Treatment: Colace, Lopressor, Senna, Contin...Person Names - Problem: MR, E.Coli, Pseudomonas, Addison...Person Names - Test: Apgars, Papanicolaou, SP Stickney, Hct...

    PHI i2b2 categ. Overlap %Person Names Treatment 45.11Person Names Problem 19.08Person Names Test 12.13

    76.33% overall

    Impact on Clinical Information (cont.)

  • Even an efficient text de-identification system can mistakenly consider clinical information as PHI. This overlap is only 1.78% if considering even partial matches.Another study by Deleger et al. compared automated medications extraction from clinical text before and after de-identification. They found no significant difference.We comparing SNOMED-CT concept annotations by cTAKES before and after de-identification, we found 1.2-3% of concepts lost, depending on de-identification accuracy (partly significant difference). Most concepts “lost” were false positives (e.g., “VA” recognized as “vertebral artery”).

    Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. JAMIA. 2012. Aug.2

    Impact on Clinical Information (cont.)

  • All methods to assess the risk for re-identification were applied to a small number of structured and coded data (demographics, location), not to narrative text (work by Malin, El Emam etc.).Clinical notes are rich in clinical and social information that can be unique and could be used to re-identify a patient.This risk is significant (23% of 2010 i2b2 corpus documents have unique ICD-9-CM or CPT codes), but limited by access to other identified data sets with clinical codes.

    How to limit this risk? Exclude narrative text from de-identified data sets?Require controlled access and data use agreements?Apply anonymization techniques to non-PHI content?

    Risk of re-identification

  • De-identificationof French Clinical Texts :The LIMSI Experiments

    Cyril Grouin Pierre Zweigenbaum

    LIMSI-CNRSOrsay, France

    MEDINFO 2013 Panel on De-IdentificationCopenhagen, 23/8/2013

  • De-Identification of French Clinical TextsPrevious Work

    I Ruch et al. (2000)I Grouin (2002)I Grouin et al. (2009)I Proux et al. (2009)

  • LIMSI Experiments in De-Identification

    Expert-based methodsI Localization of DE-ID to process FrenchI MEDINA

    Machine-Learning methodsI CRF-based entity detection

    Cross-corpus experimentsI Cardiology (discharge reports, hospital 1)I Fetopathology (multiple report types, OCR’ed,

    hospital 2)I Mixed (multiple report types, hospital 3)

  • Expert-based Methods : LocalizationDe-id (Neamatullah et al., 2008)

    I Starting from DE-ID, which de-identifies Englishclinical texts

    I LexiconsI Patterns

    I Translated the lexiconsI Started to translate the patterns, but

    I too much dependence on language (word order, etc.)I program not written with localization in mind

    I Decided to stop and to develop a new system

  • Expert-based MethodsMEDINA (Grouin et al., 2009)

    I LexiconsI General lexicon : inflected forms, lemma, POSI Specific lexicons :

    I townsI first namesI last names

    I Apply through exact match

    I PatternsI Character propertiesI Trigger wordsI Neighborhood of already (de-)identified entities

  • Machine-learning MethodsConditional Random Fields (see Grouin, MEDINFO 2013)

    I Linear-chain CRFI Wapiti (Lavergne et al., 2010)I http ://wapiti.limsi.fr/

    I Features :

    surface features : token, capitalization, digit,punctuation, length

    morpho-syntactic : POS via TreeTaggersemantic types : lexicon, CUI via UMLSdistributional analysis : clustering via Brown et al.’s

    (1992) algorithmI Automatic feature selection : L1 regularization

    http://wapiti.limsi.fr/

  • Evaluation : Cardiology and Fetopathology

    Cardiology Corpus

    P R F ConfidenceRule-based 0.855 0.830 0.843 [0.821, 0.864]CRF 0.909 0.858 0.883 [0.864, 0.901]

    Fetopathology Corpus (OCR’ed, no adaptation)

    P R F ConfidenceRule-based 0.678 0.684 0.681 [0.633, 0.729]CRF 0.732 0.565 0.638 [0.585, 0.692]

  • Cardiology Corpus (details)

    Rule-based CRFDates (238) 0.920 0.874 0.897 0.987 0.946 0.966Last names (205) 0.903 0.907 0.905 0.892 0.883 0.887First names (109) 0.777 0.927 0.845 0.822 0.890 0.855Hospital (43) 0.500 0.372 0.427 0.931 0.628 0.750Town (22) 0.688 0.500 0.579 0.632 0.545 0.585Zip codes (8) 1.000 1.000 1.000 1.000 0.750 0.857Phone (8) 1.000 1.000 1.000 0.857 0.750 0.800

  • Cardiology vs New, Varied Corpus

    P R FMEDINA-Rules

    Detection 0.862 0.825 0.846Typing 0.846 0.804 0.824

    CRF-otherDetection 0.929 0.798 0.858Typing 0.529 0.428 0.473

    CRF-test 10×cvDetection 0.991 0.934 0.962Typing 0.959 0.876 0.916

  • Limitations

    I Size of annotated corporaI More precisely, number of training examples

    I Should handle “boilerplate” material differentlyI Address in headerI Signature in footer

    I Lexicons are always incompleteI Lexicon features may however receive high

    confidenceI which may prevent classifier from learning

    features with better generalization power

  • Types of featuresGeneralization power

    Current token is a clue : learn specific names, locations,etc.Smith

    Current token is in a lexical_class : lexicons of names,locations, etc.Michael|Paul|Laura|. . .

    Context of current token is a clue : Dr. xxxxxxxxxx , Ph.D.xxxxx has undergone

    Current token belongs to a class : xxxxxCapitalizedxxxxxNNPxxxxxdrug see also lexicon

    Context_of_current token belongs to a class : xxxxxNNPxxxxx

  • De-Identification and Loss of Information

    I A recurring comment / question during presentationsI Does de-identification remove information ?

    I Removing identifying pieces of informationI PseudonymizationI Date shifting

    I Different goals for de-identificationI Perform Natural Language Processing researchI Publish case reportI . . .

    I Inside hospital information systemI Extracted information should be handled

    as other structured informationI Apply standard procedures for structured data

  • Thank you

    A23_967_MEDINFO2013_Hercules-Deid-panel-Medinfo-aug-23-2014A23_967_MEDINFO2013_Deid-Medinfo2013A23_967_MEDINFO2013_ZweigenbaumMEDINFOPANEL2013