natural language processing for extracting knowledge from

1

Natural Language Processing for Extracting Knowledge from Free-Text

Aurélie Névéol

Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur (LIMSI UPR 3251)

Natural Language Processing

can extract

Biomedical Knowledge

from free-text

2

Treating

patients

requires

Biomedical

Knowledge

3

Sources of

Biomedical

Knowledge

4

Large thrombus burden in

the proximal LAD artery

Obstruction totale des

artères segmentaires

Health Records Raw Research Data

Publications

Research

Protocol

What

can

NLP

do?

5

De-identification

6

Needed in the absence of informed consent.

Entity and relationship identification

7

Entity and relationship identification

8

Modality

9

Clinical NLP for EHRs

• Need to address several languages

– Access information about a variety of patient cohorts

– Validate the generalizability of algorithms

• NLP for French clinical text still in its infancy

10

NLP for French clinical text

11

• Terminologies – UMLF [Zweigenbaum et al., 2005] [Merabti et al. 2010]

• De-identification – Open source tool suite [Grouin et al., 2012]

• Entity recognition – Medication [Deléger et al. 2010] patient history [Burgun et al., 2012]

• Modality – Negation [Deléger and Grouin, 2010]

• Entity Normalization [Névéol et al., 2006] [Pereira et al., 2008]

• Part of Speech tagging

Benefit of

CT venography?

Thrombo-embolic Diseases

• Advances of imaging technology

– CT angiography and CT venography used for the diagnosis of thrombo-embolic diseases

– Detection of incidental clinically relevant findings

12

CTV did not show any difference in the

detection of isolated DVT in high-risk patients.

Krishan et al. (AJR 2011)

The improvement in diagnostic yied is too

limited to justify the added radiation risk.

Perrier and Bounameaux (NEJM 2006)

the added yield is marginal and CT

venography should be used selectively. Remy-

Jardin et al (Radiology 2007)

In severely ill inpatients and intensive care

unit patients, CT venography seems

appropriate.

Goodman et al. (Radiology 2009)

CTV may be useful in patients with a high

probability of pulmonary embolism, including those

with a history of VTE and possible malignancy.

Hunsaker et al (AJR 2008)

the addition of indirect CT venography increased

the diagnosis of VTE in 27.4% of patients.

Ghaye et al (Radiology 2006)

Clinical Questions

• Benefit of CT venography in the diagnosis of Pulmonary Embolism and Deep Venous Thrombosis?

• Prevalence of Incidental Findings?

13

Data from a large French hospital

Research Protocol (with IRB approval)

14

HEGP

i2b2

Radiology

reports

Gold standard

Physician

assessment

+

−

?

PE DVT Icd

MEDINA

Automatic Processing

Naïve Bayes

Classifier

PE DVT Icd

Selection of radiography

reports with indication of

thromboembolic disease

De-identification

Classification

+

−

?

+

−

?

+

−

+

− +

−

Validation

Radiology

reports

De-identification

• MEDINA [Grouin et al., 2012] customized with HEGP patient data

15

Evaluation Last names First names Overall

Recall 95,6% 90,0% 94,2%

F1-score 95,5% 86,5% 93,3%

De-identification protocol

16

• Validate customized rule-based de-identification

on a small set of records

• Train a statistical model (CRF)

• Apply to more records

Gold standard: physician assesment

on 615 de-identified reports

17

ANGIOSCANNER THORACIQUE ET PHLEBOSCANNER

INDICATION :

Suspicion d' embolie pulmonaire .

Antécédent de phlébite , sous traitement anticoagulant .

RESULTATS :

1 ) Thorax :

Analyse des artères jusuq'au niveau sous segmentaire .

Présence d' embols :

-à droite : apical du lobe supérieur et du lobe inférieur , end distalité du

segment postéro médial

-à gauche : artère lingulaire et tronc commun paracardiaque - antérobasal .

Le tronc de l' artère pulmonaire est de calibre augmentée mesuré à 33

millimètres .

Petite dilatation des cavités cardiaques droites aec un rapport VD/VG = 1 .

Masse tissulaire , prenant le contraste , de 38 par 22mm de grand axe

transverse , sous pleurale dans le segment apical du lobe inférieur gauche .

Infiltration péri hilaire gauche avec épaississement de toutes les parois

bronchiques .

présence également d' un épaississement péribronchovasculaire périhilaire

gauche prédominant nettement dans le lobe inférieur ou il s' associe à un

épaississement des lignes septales en faveur d' une lymphangite

carcinomateuse .

Adénomégalies médiastinales et hilaires bilatétrales avec en particulier

une adénopathie de 34 mm de la loge de Barety .

Epanchement pleural gauche de faible abondance .

Epaississement de la scissure gauche .

Pas d' épanchement péricardique .

2 ) Abdomen , pelvis et membres inférieurs :

Thrombose des deux veines poplitées , deux fémorales superficielles ,

saphène droite , iliaque externe droite .

Pas de thrombose veineuse profonde objectivée à l' étage

abdominopelvien .

Pas d' anomalie patente du foie , des voies biliaires , du pancréas , de la

rate , des surrénales .

Encoche rénale droite , séquellaire de néphrite .

Pas d' épanchement liquidien intra-abdominal .

Athérome aotique calcifié diffus .

CONCLUSION :

Embolie pulmonaire bilatérale , au maximum lobaire ( lingula ) avec

dilatation des cavités droties ( VD/VG=1 )

Thrombose des deux veines poplitées , deux fémorales superficielles ,

saphène droite , iliaque externe droite .

Masse tissulaire du lobe inférieur droit de 4 cm avec aspect de

lymphangite carcinomateuse et adénomégalies médiastinales

bilatérales .

Conclusive CTA-CTV exam

Suspected PE

CT Angiography: PE

CT Venography: DVT

Incentaloma

Gold standard: physician assesment

on 615 de-identified reports

18

N N

Incidentalomas 93 Indication

Suspected PE 602

Other 13

Post-partum 3

Interpretation quality of exam

Conclusive 506

Artefacts on CT Angiography 70

Artefacts on CT Venography 28

Artefacts on CTA-CTV 11

CT Angiography

CT Venography Negative Positive Total

Negative 417 52 469

Positive 30 74 104

Total 447 126 573

Modelling radiology findings

19

Disorders • Thromboembolic disease

• Incidentaloma

Anatomy Exams

Location_of Reveals

Positive

Negative

Hypothetical

Historical

Incidental

Report Annotation

• Automatic pre-annotation with a lexicon – Terms from MeSH, FMA and MedCode

– Covered entities only

• Manual revision – By a phycisian and a computational linguist

– Created annotations for modalities, relations

20

Annotation Strategy

• Full annotation – Revise entire report

– 20 minutes per report

• Light annotation – Focus on conclusion

– 7 minutes per report

21

Annotation Tools

• BRAT Rapid Annotation Tool

– Free from http://brat.nlplab.org/

– Easy to install & use for entity, modalities, relations

– Supports pre-annotations

– Inter-annotator agreement software available from

https://bitbucket.org/nicta_biomed/brateval

22

http://brat.nlplab.org/





Annotation Interface

23

• Modality is important (only 42% positive)

• Many Location_of, few Reveals

Distribution of Annotations

26

Entity N Relation N Modality N

Anatomy 9,702 Location_of 2,507 Negative 1,759

Thromboembolic

disease (TD) 3,116 Reveals 42 Positive 1,653

Exam 1,582 History 293

Incidentalomas (K) 1,478 Hypothetical 123

Post-Partum 3 Incidental 118

Inter-annotator agreement

27

Relations Exact Match Inexact Match

Anatomy Location_of K 80 80

Anatomy Location_of TD 50 66.7

Entities Exact Match Inexact Match

Anatomy 73.5 90.9

Thromboembolic disease (TD) 95.7 99.1

Exam 89.3 89.3

Incidentalomas (K) 78.4 78.4

( F-score )

Automatic Report Classification

28

• Naïve Bayes Classifier

– Inbalanced dataset

Whole Corpus

– Features • Text words (with n-gram and stopword filters)

• Annotations

• Report structure

TEST

TRAINING

+ –

Duplication of positive

examples

Random selection of

100 reports

– + – +

Classification Results

29

Precision Recall F1-score

Pulmonary Embolism (PE)

Text words 0.778 0.955 0.857

Text words + annotations 0.987 0.974 0.981

Deep Veinous Thrombosis (DVT)

Text words 0.421 0.889 0.571


EP and/or TVP

Text words 0.767 0.852 0.807

Text words + annotations 0.920 0.852 0.885 Incidentaloma

Text words 0.320 0.286 0.302


+ Structure 0.500 0.750 0.667

• Up to 90% improvement using annotations

Contributions of this work

• Good classification results, comparable to similar work on English [Chapman et al., 2011]

• Development of annotated corpus of radiology

reports in French – Potential for terminology enrichment – training data for automatic extraction of entities,

modalities and relations

• Validates the contribution of annotations to

document classification

30

Limitations

• Annotations are obtained manually – On-going work to automate the process – Preliminary results are promising

• Incidental findings classification results – Lower than PE, DVT classification – But potential generalization to other radiology reports

• Corpus size – Plan to scale up

31

On-going work

• Framework for clinical NLP

– Comprehensive annotation scheme building on UMLS and SHARP

– Annotated corpus for French

– Tools for French, including de-identification

32

Comprehensive Annotation Scheme: 18 entities, 19 relations

33

Pilot annotation study

• Goals:

– Validate the annotation scheme

• Coverage

• Applicability

– Assess the feasibility of corpus annotation

• Across domains

• Across document types

34

Study Set-up

• Corpus: 35 de-identified clinical documents – 5 preliminary

– 15 foetopathology

– 15 varia

• Annotators: 4 NLP/bioNLP experts with annotation experience

• Detailed guidelines – Include examples

– Advocate using UMLS/PTS reference for entities

35

Overall annotation process

36

Sample annotated text

37

EHR

Foetopathology

Annotation Distribution

38

Foetopathology EHR

Tokens 4240 3976

Annotated tokens 2941 (69%) 2052 (52%)

Annotated entities 1924 1168

Annotation Distribution

39

05

1015202530354045

An

ato

my

Mea

sure

men

t

Dis

ord

er

Co

nce

pt

Idea

Med

ical

Pro

ced

ure

Bio

logi

calP

roce

ssO

rF…

Mo

dal

ityA

nch

or

Livi

ngB

ein

gs

Du

rati

on

Ch

emic

als

Dru

gs

Sign

OrS

ymp

tom

Gen

es P

rote

ins

Dat

e

Freq

uen

cy

Dev

ices

Do

sage

Stre

ngt

h

Dru

gFo

rm

Ad

min

istr

atio

nR

ou

te

EHR

Foetopathology

%

Results

• Inter annotator agreement progress – From 50% to 81% for entities

• Inter-annotator agreement variation – Per annotator pair (7% for entities, 20% for relations)) – Per entity/relation type – Per document type

• Pre-annotations – Precision higher than recall (84% vs. 57% on FP corpus) – Found helpful by annotators

40

Annotation lessons learned

• Annotating is hard!

• Annotation scheme is appropriate – Further tuning in process

• Homogeneous documents help with IAA – Consider for selection of documents for larger corpus

• Annotate relations from entity consensus

41

Natural Language Processing

can extract

Biomedical Knowledge

from free-text

42

→ Towards a framework for clinical NLP applied to French texts

Thank you!

[email protected]

Collaborators: Louise Deléger Anne-Dominique Pham

Cyril Grouin Guy Meyer, Olivier Clément

Thomas Lavergne Anita Burgun

Anne-Laure Ligozat Pierre Zweigembaum

Funding:

CABeRneT ANR-13-JS02-0009-01

43

mailto:[email protected]

natural language processing for extracting knowledge from

Documents