natural language processing for extracting knowledge from
TRANSCRIPT
1
Natural Language Processing for Extracting Knowledge from Free-Text
Aurélie Névéol
Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur (LIMSI UPR 3251)
Natural Language Processing
can extract
Biomedical Knowledge
from free-text
2
Treating
patients
requires
Biomedical
Knowledge
3
Sources of
Biomedical
Knowledge
4
Large thrombus burden in
the proximal LAD artery
Obstruction totale des
artères segmentaires
Health Records Raw Research Data
Publications
Research
Protocol
What
can
NLP
do?
5
De-identification
6
Needed in the absence of informed consent.
Entity and relationship identification
7
Entity and relationship identification
8
Modality
9
Clinical NLP for EHRs
• Need to address several languages
– Access information about a variety of patient cohorts
– Validate the generalizability of algorithms
• NLP for French clinical text still in its infancy
10
NLP for French clinical text
11
• Terminologies – UMLF [Zweigenbaum et al., 2005] [Merabti et al. 2010]
• De-identification – Open source tool suite [Grouin et al., 2012]
• Entity recognition – Medication [Deléger et al. 2010] patient history [Burgun et al., 2012]
• Modality – Negation [Deléger and Grouin, 2010]
• Entity Normalization [Névéol et al., 2006] [Pereira et al., 2008]
• Part of Speech tagging
Benefit of
CT venography?
Thrombo-embolic Diseases
• Advances of imaging technology
– CT angiography and CT venography used for the diagnosis of thrombo-embolic diseases
– Detection of incidental clinically relevant findings
12
CTV did not show any difference in the
detection of isolated DVT in high-risk patients.
Krishan et al. (AJR 2011)
The improvement in diagnostic yied is too
limited to justify the added radiation risk.
Perrier and Bounameaux (NEJM 2006)
the added yield is marginal and CT
venography should be used selectively. Remy-
Jardin et al (Radiology 2007)
In severely ill inpatients and intensive care
unit patients, CT venography seems
appropriate.
Goodman et al. (Radiology 2009)
CTV may be useful in patients with a high
probability of pulmonary embolism, including those
with a history of VTE and possible malignancy.
Hunsaker et al (AJR 2008)
the addition of indirect CT venography increased
the diagnosis of VTE in 27.4% of patients.
Ghaye et al (Radiology 2006)
Clinical Questions
• Benefit of CT venography in the diagnosis of Pulmonary Embolism and Deep Venous Thrombosis?
• Prevalence of Incidental Findings?
13
Data from a large French hospital
Research Protocol (with IRB approval)
14
HEGP
i2b2
Radiology
reports
Gold standard
Physician
assessment
+
−
?
PE DVT Icd
MEDINA
Automatic Processing
Naïve Bayes
Classifier
PE DVT Icd
Selection of radiography
reports with indication of
thromboembolic disease
De-identification
Classification
+
−
?
+
−
?
+
−
+
− +
−
Validation
Radiology
reports
De-identification
• MEDINA [Grouin et al., 2012] customized with HEGP patient data
15
Evaluation Last names First names Overall
Recall 95,6% 90,0% 94,2%
F1-score 95,5% 86,5% 93,3%
De-identification protocol
16
• Validate customized rule-based de-identification
on a small set of records
• Train a statistical model (CRF)
• Apply to more records
Gold standard: physician assesment
on 615 de-identified reports
17
ANGIOSCANNER THORACIQUE ET PHLEBOSCANNER
INDICATION :
Suspicion d' embolie pulmonaire .
Antécédent de phlébite , sous traitement anticoagulant .
RESULTATS :
1 ) Thorax :
Analyse des artères jusuq'au niveau sous segmentaire .
Présence d' embols :
-à droite : apical du lobe supérieur et du lobe inférieur , end distalité du
segment postéro médial
-à gauche : artère lingulaire et tronc commun paracardiaque - antérobasal .
Le tronc de l' artère pulmonaire est de calibre augmentée mesuré à 33
millimètres .
Petite dilatation des cavités cardiaques droites aec un rapport VD/VG = 1 .
Masse tissulaire , prenant le contraste , de 38 par 22mm de grand axe
transverse , sous pleurale dans le segment apical du lobe inférieur gauche .
Infiltration péri hilaire gauche avec épaississement de toutes les parois
bronchiques .
présence également d' un épaississement péribronchovasculaire périhilaire
gauche prédominant nettement dans le lobe inférieur ou il s' associe à un
épaississement des lignes septales en faveur d' une lymphangite
carcinomateuse .
Adénomégalies médiastinales et hilaires bilatétrales avec en particulier
une adénopathie de 34 mm de la loge de Barety .
Epanchement pleural gauche de faible abondance .
Epaississement de la scissure gauche .
Pas d' épanchement péricardique .
2 ) Abdomen , pelvis et membres inférieurs :
Thrombose des deux veines poplitées , deux fémorales superficielles ,
saphène droite , iliaque externe droite .
Pas de thrombose veineuse profonde objectivée à l' étage
abdominopelvien .
Pas d' anomalie patente du foie , des voies biliaires , du pancréas , de la
rate , des surrénales .
Encoche rénale droite , séquellaire de néphrite .
Pas d' épanchement liquidien intra-abdominal .
Athérome aotique calcifié diffus .
CONCLUSION :
Embolie pulmonaire bilatérale , au maximum lobaire ( lingula ) avec
dilatation des cavités droties ( VD/VG=1 )
Thrombose des deux veines poplitées , deux fémorales superficielles ,
saphène droite , iliaque externe droite .
Masse tissulaire du lobe inférieur droit de 4 cm avec aspect de
lymphangite carcinomateuse et adénomégalies médiastinales
bilatérales .
Conclusive CTA-CTV exam
Suspected PE
CT Angiography: PE
CT Venography: DVT
Incentaloma
Gold standard: physician assesment
on 615 de-identified reports
18
N N
Incidentalomas 93 Indication
Suspected PE 602
Other 13
Post-partum 3
Interpretation quality of exam
Conclusive 506
Artefacts on CT Angiography 70
Artefacts on CT Venography 28
Artefacts on CTA-CTV 11
CT Angiography
CT Venography Negative Positive Total
Negative 417 52 469
Positive 30 74 104
Total 447 126 573
Modelling radiology findings
19
Disorders • Thromboembolic disease
• Incidentaloma
Anatomy Exams
Location_of Reveals
Positive
Negative
Hypothetical
Historical
Incidental
Report Annotation
• Automatic pre-annotation with a lexicon – Terms from MeSH, FMA and MedCode
– Covered entities only
• Manual revision – By a phycisian and a computational linguist
– Created annotations for modalities, relations
20
Annotation Strategy
• Full annotation – Revise entire report
– 20 minutes per report
• Light annotation – Focus on conclusion
– 7 minutes per report
21
Annotation Tools
• BRAT Rapid Annotation Tool
– Free from http://brat.nlplab.org/
– Easy to install & use for entity, modalities, relations
– Supports pre-annotations
– Inter-annotator agreement software available from
https://bitbucket.org/nicta_biomed/brateval
22
Annotation Interface
23
24
25
• Modality is important (only 42% positive)
• Many Location_of, few Reveals
Distribution of Annotations
26
Entity N Relation N Modality N
Anatomy 9,702 Location_of 2,507 Negative 1,759
Thromboembolic
disease (TD) 3,116 Reveals 42 Positive 1,653
Exam 1,582 History 293
Incidentalomas (K) 1,478 Hypothetical 123
Post-Partum 3 Incidental 118
Inter-annotator agreement
27
Relations Exact Match Inexact Match
Anatomy Location_of K 80 80
Anatomy Location_of TD 50 66.7
Entities Exact Match Inexact Match
Anatomy 73.5 90.9
Thromboembolic disease (TD) 95.7 99.1
Exam 89.3 89.3
Incidentalomas (K) 78.4 78.4
( F-score )
Automatic Report Classification
28
• Naïve Bayes Classifier
– Inbalanced dataset
Whole Corpus
– Features • Text words (with n-gram and stopword filters)
• Annotations
• Report structure
TEST
TRAINING
+ –
Duplication of positive
examples
Random selection of
100 reports
– + – +
Classification Results
29
Precision Recall F1-score
Pulmonary Embolism (PE)
Text words 0.778 0.955 0.857
Text words + annotations 0.987 0.974 0.981
Deep Veinous Thrombosis (DVT)
Text words 0.421 0.889 0.571
Text words + annotations 0.727 0.889 0.800
EP and/or TVP
Text words 0.767 0.852 0.807
Text words + annotations 0.920 0.852 0.885 Incidentaloma
Text words 0.320 0.286 0.302
Text words + annotations 0.667 0.500 0.571
+ Structure 0.500 0.750 0.667
• Up to 90% improvement using annotations
Contributions of this work
• Good classification results, comparable to similar work on English [Chapman et al., 2011]
• Development of annotated corpus of radiology
reports in French – Potential for terminology enrichment – training data for automatic extraction of entities,
modalities and relations
• Validates the contribution of annotations to
document classification
30
Limitations
• Annotations are obtained manually – On-going work to automate the process – Preliminary results are promising
• Incidental findings classification results – Lower than PE, DVT classification – But potential generalization to other radiology reports
• Corpus size – Plan to scale up
31
On-going work
• Framework for clinical NLP
– Comprehensive annotation scheme building on UMLS and SHARP
– Annotated corpus for French
– Tools for French, including de-identification
32
Comprehensive Annotation Scheme: 18 entities, 19 relations
33
Pilot annotation study
• Goals:
– Validate the annotation scheme
• Coverage
• Applicability
– Assess the feasibility of corpus annotation
• Across domains
• Across document types
34
Study Set-up
• Corpus: 35 de-identified clinical documents – 5 preliminary
– 15 foetopathology
– 15 varia
• Annotators: 4 NLP/bioNLP experts with annotation experience
• Detailed guidelines – Include examples
– Advocate using UMLS/PTS reference for entities
35
Overall annotation process
36
Sample annotated text
37
EHR
Foetopathology
Annotation Distribution
38
Foetopathology EHR
Tokens 4240 3976
Annotated tokens 2941 (69%) 2052 (52%)
Annotated entities 1924 1168
Annotation Distribution
39
05
1015202530354045
An
ato
my
Mea
sure
men
t
Dis
ord
er
Co
nce
pt
Idea
Med
ical
Pro
ced
ure
Bio
logi
calP
roce
ssO
rF…
Mo
dal
ityA
nch
or
Livi
ngB
ein
gs
Du
rati
on
Ch
emic
als
Dru
gs
Sign
OrS
ymp
tom
Gen
es P
rote
ins
Dat
e
Freq
uen
cy
Dev
ices
Do
sage
Stre
ngt
h
Dru
gFo
rm
Ad
min
istr
atio
nR
ou
te
EHR
Foetopathology
%
Results
• Inter annotator agreement progress – From 50% to 81% for entities
• Inter-annotator agreement variation – Per annotator pair (7% for entities, 20% for relations)) – Per entity/relation type – Per document type
• Pre-annotations – Precision higher than recall (84% vs. 57% on FP corpus) – Found helpful by annotators
40
Annotation lessons learned
• Annotating is hard!
• Annotation scheme is appropriate – Further tuning in process
• Homogeneous documents help with IAA – Consider for selection of documents for larger corpus
• Annotate relations from entity consensus
41
Natural Language Processing
can extract
Biomedical Knowledge
from free-text
42
→ Towards a framework for clinical NLP applied to French texts
Thank you!
Collaborators: Louise Deléger Anne-Dominique Pham
Cyril Grouin Guy Meyer, Olivier Clément
Thomas Lavergne Anita Burgun
Anne-Laure Ligozat Pierre Zweigembaum
Funding:
CABeRneT ANR-13-JS02-0009-01
43