automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical...

11
Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study Maria Skeppstedt a,, Maria Kvist a,b,c , Gunnar H. Nilsson d , Hercules Dalianis a a Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, SE-164 40 Kista, Sweden b Department of Clinical Immunology and Transfusion Medicine, Karolinska University Hospital, Stockholm, Sweden c Department of Learning, Informatics, Management and Ethics (LIME), Karolinska Institutet, Sweden d Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden article info Article history: Received 10 July 2013 Accepted 23 January 2014 Available online xxxx Keywords: Named entity recognition Corpora development Clinical text processing Disorder Finding Swedish abstract Automatic recognition of clinical entities in the narrative text of health records is useful for constructing applications for documentation of patient care, as well as for secondary usage in the form of medical knowledge extraction. There are a number of named entity recognition studies on English clinical text, but less work has been carried out on clinical text in other languages. This study was performed on Swedish health records, and focused on four entities that are highly rel- evant for constructing a patient overview and for medical hypothesis generation, namely the entities: Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how well named entity recognition methods previously applied to English clinical text perform on similar texts written in Swedish; and to evaluate whether it is meaningful to divide the more general category Medical Problem, which has been used in a number of previous studies, into the two more granular entities, Dis- order and Finding. Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selected entity categories, and the inter-annotator agreement between two pairs of annotators was measured, resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and 0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable features for training a conditional random fields model. Finally, a new model was trained on this subset, using the best features and settings, and its ability to generalise to held-out data was evaluated. This final model obtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder + Finding. The obtained results, which are in line with or slightly lower than those for similar studies on English clinical text, many of them conducted using a larger training data set, show that the approaches used for English are also suitable for Swedish clinical text. However, a small proportion of the errors made by the model are less likely to occur in English text, showing that results might be improved by further tailoring the system to clinical Swedish. The entity recognition results for the individual entities Disorder and Finding show that it is meaningful to separate the general category Medical Problem into these two more granular entity types, e.g. for knowledge mining of co-morbidity relations and disorder-finding relations. Ó 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). 1. Introduction Electronic health records contain valuable information in the form of symptom descriptions, documentation of examinations, diagnostic reasoning and motivations for treatment decisions. Automatic extraction of this information makes it possible to im- prove applications for patient care documentation, and enables secondary usage of the information in the form of medical knowl- edge mining. While a subset of the health record information, e.g. medication lists and diagnosis coding, is documented in a structured format, much important information is only available as free text [1]. An automatic summary of the free text part is therefore called for, pro- viding health personnel with the possibility of forming a quick overview of the patient [2,3]. The information contained in health records can also be used for clinical text mining, i.e. to generate new medical knowledge from a large corpus of electronic health http://dx.doi.org/10.1016/j.jbi.2014.01.012 1532-0464/Ó 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Corresponding author. Fax: +46 (0)8 703 90 25. E-mail address: [email protected] (M. Skeppstedt). Journal of Biomedical Informatics xxx (2014) xxx–xxx Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin Please cite this article in press as: Skeppstedt M et al. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.01.012

Upload: hercules

Post on 30-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Journal of Biomedical Informatics xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

Automatic recognition of disorders, findings, pharmaceuticals and bodystructures from clinical text: An annotation and machine learning study

http://dx.doi.org/10.1016/j.jbi.2014.01.0121532-0464/� 2014 The Authors. Published by Elsevier Inc.This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

⇑ Corresponding author. Fax: +46 (0)8 703 90 25.E-mail address: [email protected] (M. Skeppstedt).

Please cite this article in press as: Skeppstedt M et al. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinicAn annotation and machine learning study. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.01.012

Maria Skeppstedt a,⇑, Maria Kvist a,b,c, Gunnar H. Nilsson d, Hercules Dalianis a

a Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, SE-164 40 Kista, Swedenb Department of Clinical Immunology and Transfusion Medicine, Karolinska University Hospital, Stockholm, Swedenc Department of Learning, Informatics, Management and Ethics (LIME), Karolinska Institutet, Swedend Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden

a r t i c l e i n f o a b s t r a c t

Article history:Received 10 July 2013Accepted 23 January 2014Available online xxxx

Keywords:Named entity recognitionCorpora developmentClinical text processingDisorderFindingSwedish

Automatic recognition of clinical entities in the narrative text of health records is useful for constructingapplications for documentation of patient care, as well as for secondary usage in the form of medicalknowledge extraction. There are a number of named entity recognition studies on English clinical text,but less work has been carried out on clinical text in other languages.

This study was performed on Swedish health records, and focused on four entities that are highly rel-evant for constructing a patient overview and for medical hypothesis generation, namely the entities:Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how wellnamed entity recognition methods previously applied to English clinical text perform on similar textswritten in Swedish; and to evaluate whether it is meaningful to divide the more general category MedicalProblem, which has been used in a number of previous studies, into the two more granular entities, Dis-order and Finding.

Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selectedentity categories, and the inter-annotator agreement between two pairs of annotators was measured,resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable featuresfor training a conditional random fields model. Finally, a new model was trained on this subset, using thebest features and settings, and its ability to generalise to held-out data was evaluated. This final modelobtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for BodyStructure and 0.78 for the combined category Disorder + Finding.

The obtained results, which are in line with or slightly lower than those for similar studies on Englishclinical text, many of them conducted using a larger training data set, show that the approaches used forEnglish are also suitable for Swedish clinical text. However, a small proportion of the errors made by themodel are less likely to occur in English text, showing that results might be improved by further tailoringthe system to clinical Swedish. The entity recognition results for the individual entities Disorder andFinding show that it is meaningful to separate the general category Medical Problem into these two moregranular entity types, e.g. for knowledge mining of co-morbidity relations and disorder-finding relations.

� 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/3.0/).

1. Introduction

Electronic health records contain valuable information in theform of symptom descriptions, documentation of examinations,diagnostic reasoning and motivations for treatment decisions.Automatic extraction of this information makes it possible to im-prove applications for patient care documentation, and enables

secondary usage of the information in the form of medical knowl-edge mining.

While a subset of the health record information, e.g. medicationlists and diagnosis coding, is documented in a structured format,much important information is only available as free text [1]. Anautomatic summary of the free text part is therefore called for, pro-viding health personnel with the possibility of forming a quickoverview of the patient [2,3]. The information contained in healthrecords can also be used for clinical text mining, i.e. to generatenew medical knowledge from a large corpus of electronic health

al text:

Page 2: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

2 M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

records. Syndromic surveillance [4], comorbidity studies [5] andautomatic detection of adverse drug reactions [6] are examplesof clinical text mining applications.

An important component in information extraction from healthrecord text is named entity recognition (NER) of relevant entitiesmentioned in the text, i.e. the automatic detection of spans of textreferring to entities of certain semantic categories [7]. This studyfocuses on recognition of four entity categories that are particu-larly relevant for constructing a patient overview as well as forstudies of co-morbidity [5,8], disorder and finding co-occurrences[9] and adverse drug reactions [6], namely the categories: Disorder,Finding, Pharmaceutical Drug and Body Structure.

There are previous studies on the recognition of clinical entitiesin English text, but very few studies have been carried out on clin-ical text written in other languages. The present study was per-formed on Swedish clinical text, and although both Swedish andEnglish are Germanic languages, NER in Swedish poses additionalchallenges, as Swedish is more inflective and compounding ofwords occurs frequently. In addition, medical terminological re-sources are less extensive for Swedish than for English.

Moreover, previous annotation and NER studies have typicallycombined the two more granular entity categories Disorder andFinding into one more general category, e.g. called Condition orMedical Problem [10–12], or have focused only on the entity cate-gory Disorder [13,14]. To the best of our knowledge, there is onlyone previous corpus [15] in which the categories Disorder andFinding are annotated as two separate entity categories. The studydescribing the creation of this corpus does not, however, investi-gate the effect of this more granular division.

The present study therefore has two specific researchquestions:

� To what extent is it possible to use the NER methods, whichhave been successful for English clinical texts, on health recordtexts written in Swedish?� To what extent is it possible to separate the more general entity

category Medical Problem into the two more granular entitycategories Disorder and Finding?

2. Related research

There are several studies that describe the creation of corporaannotated for named entities and that measure inter-annotatoragreement scores between annotators (Table 1). There are also anumber of studies in which the created corpora are used for train-ing and/or evaluating NER systems (Table 2).

The annotation study by Chapman et al. [16] showed that de-tailed guidelines for annotating Clinical Conditions resulted in asubstantially higher F-score than less detailed, but no significantdifferences in inter-annotator agreement between pairs ofphysicians and between physicians and lay people were found(but lay people required more training and had a lower abilityto retain their annotation skills over time). Within the CLinicalE-Science Framework, Roberts et al. [17] observed a higherF-score for lay people (a biologist/linguist and a computationallinguist) than for a clinician, when measuring agreement to aconstructed consensus set containing annotations for Condition(symptom and diagnosis), Drug or Device and Locus (e.g.anatomical structure or location). Wang [18] measured the in-ter-annotator agreement between two computational linguistsannotating the categories Finding (corresponding to MedicalProblem), Substance and Body, while Ogren et al. [19] measuredthe average agreement between four clinical data retrieval ex-perts for annotating identical spans of text denoting the categoryDisorder. For the i2b2 medication challenge [20,21], inter-annota-tor agreement was calculated for annotations of Medication

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx

Names on pre-annotated data. No statistically significantdifferences were observed between pairs of NLP communityannotators, pairs of expert annotators, or pairs of expertsannotating raw text. One participating group [22] annotated anadditional subset of the development data provided for thechallenge. Pre-annotation was also applied when annotatingthe MiPACQ corpus [15] for entity categories including Disorder,Anatomy, Sign or Symptom, and Chemical and Drug.

The created corpora have been used as training and evaluationdata for machine learning-based NER systems and as evaluationdata for rule- and terminology-based systems. An SVM (supportvector machine) with uneven margins was trained on a subset ofthe CLinical E-Science Framework corpus [24], and two studies havebeen performed on the corpus created by Wang; one using the CRF(conditional random fields) package CRF++ [18] and one combiningoutput from CRF++ with an SVM and an ME (maximum entropy)classifier [10]. All but one of the best performing systems in thei2b2/VA challenge on concepts, assertions, and relations used CRFfor concept recognition [23,25]. The best performing system (byde Bruijn et al. [11]) instead used semi-Markov HMM. The secondbest (by Jiang et al. [12]) found that CRF (CRF++) outperformedSVM, and also managed to improve the results with a rule-basedpost-processing module. In the i2b2 medication challenge, on theother hand, which included the identification of MedicationNames, a majority of the ten top-ranked systems were rule-based[20]. The best performing system (by Patrick and Li [22]) did, how-ever, use CRF++, while the second best (by Doan et al. [26]) wasbuilt on terminology matching and a spell checker developed fordrug names. This rule-based system was later employed by Doanet al. [27] in an ensemble classifier, together with an SVM and aCRF++ system. On the Ogren et al. [19] corpus, a terminology-basedmethod for recognising disorders by matching to SNOMED CT hasbeen evaluated [14,28], and there is also a terminology-basedstudy for recognising diseases and drugs in Swedish dischargesummaries, described by Kokkinakis and Thurin [13], in whichthe MeSH terminology was used.

Typical features used for training the machine learning modelswere the tokens (sometimes in a stemmed form), orthographics(e.g. number, word, capitalisation), prefixes and suffixes, part-of-speech information, as well as the output of terminology matching,which had a large positive effect in many studies (e.g. [10,18]).Most studies used features extracted from the current and thetwo preceding and two following tokens, while Roberts et al. [24]used a window size of ±1. The best performing system in thei2b2/VA concepts challenge used a very large feature set with awindow size of ±4, also including character n-grams, word bi/tri/quad-grams and skip-n-grams, as well as sentence, section anddocument features (e.g. sentence and document length and sectionheadings). In addition, features from semi-supervised learningmethods were incorporated, in the form of hierarchical word clus-ters constructed on unlabelled data [11].

3. Methods

A corpus was first annotated and evaluated. Thereafter, suitablefeatures for the NER task were evaluated and a model was trainedusing these features and evaluated on held-out data. Finally, an er-ror analysis was carried out.

The IOB-encoding [7, pp. 763–764] of the annotated entitieswas used, as exemplified in Fig. 1. As machine learning algorithm,the CRF (conditional random fields) [29] implementation CRF++[30] was chosen, which has been used in many previous clinicalNER studies. CRF++ was used as linear chain CRF, in which eachoutput variable is dependent on the previous and subsequent out-put variable.

disorders, findings, pharmaceuticals and body structures from clinical text:.doi.org/10.1016/j.jbi.2014.01.012

Page 3: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Table 1Inter-annotator agreement in previous annotation studies. The data shown is the inter-annotator agreement (F-score), types of clinical text and annotation classes. The entitytypes Clinical Condition, Condition and Finding are fairly equivalent, including what is referred to as the category Medical Problem or Disorder + Finding in our study.

Annotation class F-score

Chapman et al. [16]; Emergency department reportsClinical Condition 0.92

CLinical E-Science Framework [17]; (1) Narratives: To GP, Discharge letter and Case note. (2) Imaging reports. (3) Histopathology reportsCondition 0.81, 0.77, 0.67Drug or Device 0.84, 0.32, 0.59Locus 0.78, 0.75, 0.71

Wang [10]; Intensive care service progress notesFinding 0.91Substance 0.95Body 0.85

Ogren et al. [19]; Outpatient notes, discharge summaries, inpatient service notesDisorder 0.76

i2b2/VA challenge on concepts . . . [23]; Discharge summaries, some progress notesMedical Problem, Test and Treatment Not given

i2b2 medication challenge [21]; Discharge summariesMedication Name 0.86–0.91

Previous Swedish study [13]; Discharge summariesDisease and Drug Not given

MiPACQ [15]; Clinical narrativesUMLS semantic groups 0.70

M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 3

3.1. Corpus annotation

To study clinical narratives with a variety of disorders, free textsections from clinical notes from an internal medicine emergencyunit (from Karolinska University Hospital) were compiled into acorpus.1 Texts with the Assessment sub-heading were chosen asthese contain reasoning about findings as well as diagnostic specula-tions, which means that they contain many mentions of disordersand findings and are particularly suited for a study of these two en-tity categories. Assessment fields are also interesting in that theyform a prototypical example of the difficulties associated with con-ducting text processing on clinical text, as they are written using ahighly telegraphic language, containing many abbreviations andfew full sentences. The compiled corpus consisted of 1,148 randomlyselected Assessment fields that were extracted from the health re-cord database Stockholm EPR Corpus [31], which contains patient re-cords written in Swedish from the years 2006 to 2008.

The definition of the four used annotated entity categories canbe summarised as follows: (1) A Disorder is a disease or abnormalcondition that is not momentary and that has an underlying path-ological process. (2) A Finding is a symptom reported by the pa-tient, an observation made by the physician or the result of amedical examination of the patient. This includes non-pathologicalfindings with medical relevance.2 (3) A Pharmaceutical Drug is amedical drug that is either mentioned with a generic name or tradename, or with other expressions denoting drugs, e.g. drugs expressedby their effect, such as painkiller or sleeping pill. Narcotic drugs usedoutside of medical care were excluded. (4) A Body Structure is ananatomically defined body part, excluding body fluids and expres-sions indicating positions on the body.

Annotation guidelines were developed by a senior physicianwith previous experience of annotating clinical text (PH1) and acomputational linguist without previous annotation experience(CL). A test annotation of 664 Assessment fields (not included inthe 1,148 fields compiled for the final corpus) was performed by

1 This research has been approved by the Regional Ethical Review Board inStockholm (Etikprövningsnämnden i Stockholm), permission number 2012/834-31/5.

2 The definition for Disorders and Findings is a summary of the definition in theSNOMED CT Style Guide [32].

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx.

PH1, for annotator training as well as for development of the anno-tation guidelines. Whenever there was an issue not covered by theannotation guidelines, it was discussed and the guidelines wereupdated accordingly. At first, there were many modifications ofthe guidelines, but the need for modifications gradually decreased.The final version of the guidelines were reviewed by a second phy-sician, also with previous experience of annotating clinical text(PH2).

The following are the most important points of the guidelines3:The shortest possible expression that still fully describes the entitywas annotated. Modifiers that, for example, describe severity weretherefore excluded, while modifiers describing the type of an entitywere included. In the example: The patient experiences a strong stab-bing pain in left knee,4 the words strong and left were therefore notannotated, whereas stabbing pain was annotated as a Finding, andknee as a Body Structure. All mentions of any of the four selectedclasses were annotated, regardless of whether, for example, a Disor-der was referred to with an abbreviation or acronym or in a negatedor a speculative context, or the person experiencing the Finding wassomeone other than the patient. The guidelines also included rulesfor handling the frequent occurrence of compound words. Com-pound words were not split up into substrings, and therefore, forexample, diabetes in diabetesclinic5 was not annotated as a Disorder,whereas the word heartdisease6 which is a compound denoting a Dis-order, was annotated as such. A compound including a treatmentwith a pharmaceutical drug was, however, classified as belongingto the entity category Pharmaceutical Drug.

The definition of Finding was broader than the definition usedin other annotation studies, e.g. i2b2 [25], and more closely fol-lowed the definition in SNOMED CT, as also non-pathological, med-ically relevant findings were included. The guidelines did, however,agree with the i2b2 guidelines [25] in that only findings that wereexplicitly stated (e.g. high blood pressure) were included, whereastest measures (e.g. blood pressure 145/95) were not annotated.

3 The complete guidelines are available at http://dsv.su.se/health/guidelines/.4 In Swedish: Patienten känner en kraftigt huggande smärta i vänster knä.5 In Swedish: diabetesklinik.6 In Swedish: hjärtsjukdom.

disorders, findings, pharmaceuticals and body structures from clinical text:doi.org/10.1016/j.jbi.2014.01.012

Page 4: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Table 2NER results for previous studies. n is the number of training instances.

Corpus Entity category n Precision Recall

CLinical E-Science Framework Roberts et al. [24]; SVM (10-fold)Condition 739 0.82 0.65Drug or Device 272 0.83 0.59Locus 490 0.80 0.62

Wang Wang [18]; CRF (10-fold)Finding 4741 0.83 0.79Substance 2249 0.92 0.89Body 735 0.72 0.64

Wang and Patrick [10]; Combining CRF, SVM and ME (10-fold)Finding 4741 0.84 0.82Substance 2249 0.92 0.88Body 735 0.76 0.66

i2b2/VA challenge on concepts. . . Jiang et al. [12]; Combining 4 CRF models (Separate evaluation set)Medical Problem 11,968 0.87 0.84

de Bruijn et al. [11]; Semi-Markov HMM (Separate evaluation set)Medical concepts (including Problem) 27,837 0.87 0.84

i2b2 Medication challenge Patrick and Li [22]; CRFMedication Names – 0.91 0.86

Doan et al. [27]; Terminology matchingMedication Names – 0.85 0.87

Doan et al. [26]; CRF, SVM, terminologyMedication Names – 0.94 0.90

Previous Swedish study Kokkinakis and Thurin [13]; Terminology matchingDisease – 0.98 0.87Drug – 0.95 0.93

Ogren et al. Savova et al. [14]; Terminology matchingDisorder – 0.80 0.65

Fig. 1. Used features shown in a constructed example sentence. Each row contains one token and each column contains feature values corresponding to that token. Theinformation marked with boldface is used for predicting the label for the word chestpain. Chestpain (bröstsmärta) is here shown as one compound word in English to mimicthe Swedish equivalent. Two separate columns are used for the compound constituents. The last column, Category, shows the IOB-encodings of the annotated entities, whichare the desired output labels that the CRF model aims at learning.

4 M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

PH1 had the role as the main annotator, annotating all notesincluded in the study. A subset of the notes were independentlyannotated by PH2, and yet another subset was independentlyannotated by CL (Table 3). To become familiar with the annota-tion task, PH2 and CL carried out a test annotation on 50 notes.Neither these test annotations, nor the texts annotated by PH1 inthe guideline development phase, were included in the con-structed corpus. The annotation tool Knowtator [33], a plug-into Protégé, was used for all annotations. The doubly annotatednotes were used for measuring inter-annotator agreement (andthereby the reliability of the annotations [34]), as well as forconstructing a reference standard to use in the final evaluation.

Inter-annotator agreement was measured in terms of F-score, ase.g. the frequently used inter-annotator agreement measure kappacannot be applied on this task that lacks well-defined negativecases [35]. Disagreements were: entities annotated by only oneof the annotators, differences in choice of entity category, and dif-ferences in the length of annotated text spans.

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx

Out of a large subset (25,370 tokens) of the doubly annotatednotes shown in Table 3, a sub-corpus was created to use in theevaluation phase of the machine learning study (the Final Evalua-tion subset). This sub-corpus was compiled by PH1 who resolvedeach conflicting annotation in the doubly annotated data. A pro-gram for presenting and resolving annotations was developed,which presented pairs of conflicting annotations on a sentence le-vel, without revealing who had produced which annotation. PH1could thereby select one of the presented annotations withoutknowing who had produced it, thereby minimising bias.

The rest of the annotated corpus (the Development subset), wasused for feature selection and as training data for the final model.The properties of each of the two subsets are presented in Table 4.

3.2. Feature selection and evaluation

The most frequently used features among previous studies onEnglish clinical text were evaluated, including terminology matching.

disorders, findings, pharmaceuticals and body structures from clinical text:.doi.org/10.1016/j.jbi.2014.01.012

Page 5: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Table 3Data used for measuring inter annotator agreement by main annotator (PH1), other physician (PH2) and computational linguist (CL). 90% of this data was also used forconstructing the Final Evaluation subset. Entity types is the number of types of entities that were annotated.

Annotator PH1 PH2 CL

Entity category Annotated entities (Entity types) Annotated entities (Entity types) Annotated entities (Entity types)

Disorder 766 (354) 329 (174) 355 (214)Finding 1327 (715) 631 (466) 686 (461)Pharmaceuticals 636 (249) 282 (119) 262 (143)Body Structure 275 (112) 117 (67) 101 (65)

Table 4Number of annotated entities in Development subset and Final Evaluation subset.Entity types is the number of types of entities that were annotated.

Data set: Development Final evaluation

Entity category Annotated entities (Entity types) Annotated entities

Disorder 1317 (607) 681Finding 2540 (1353) 1282Pharmaceuticals 959 (350) 580Body Structure 497 (197) 253

Tokens in corpus 45,482 25,370

M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 5

MeSH, ICD-10 and SNOMED CT are available in Swedish, as well asFASS, which includes a list of pharmaceutical drugs used in Sweden[36]. One of the challenges of clinical NER in Swedish is, however,that medical terminologies are less extensive for Swedish than forEnglish, as e.g. SNOMED CT only contains the preferred term foreach concept and lacks synonyms. We have previously developedterminology-based systems for Swedish clinical text, detectingthe entities Disorder, Finding and Body Structure [37] as well asPharmaceutical Drug [38]. For the present study, these systemswere extended by adding a more extensive version of MeSH [39]as well as a vocabulary list of general Swedish, extracted fromthe Swedish Parole corpus, containing non-medical language com-piled from e.g. newspaper texts and fiction [40]. All semantic clas-ses in SNOMED CT and MeSH were used as feature values, andtokens matching the FASS vocabulary [36] were given the featurevalue Pharmaceutical Drug. Tokens matching (the descriptionsof) ICD-10 codes in chapter 1–17 and 19 (except codes T36–T62.9, which list substances) were assigned the feature valueICD-10-disorder and tokens matching codes in chapter 18 (listingsymptoms and clinical findings) were assigned the valueICD-10-finding. Tokens not found in any of the medical resources,but in the vocabulary from Parole, were assigned the feature valueParole, and tokens not found in any terminology were given the va-lue Unknown. A token matching several semantic classes was as-signed feature value according to the priority stated in theannotation guidelines (e.g. the SNOMED CT category Qualifier hav-ing the highest priority, followed by Body Structure, Disorder, Find-ing and Pharmaceutical). As described in the previous terminologymatching study [37], the SNOMED CT list of body structures wasstop word filtered.

Another particular challenge for NER in Swedish text comparedto English text is the frequent occurrence of compound words. Com-pound word splitting was therefore added as an additional featurenot used in previous clinical NER studies. We implemented a simplecompound splitting, dividing words into a maximum of two wordconstituents if at least one of the constituents was found in one ofthe vocabulary lists and the other constituent was found whenapplying fuzzy matching. The fuzzy matching, which used a maxi-mum string distance of one, was motivated by the fact that Swedishcompounds are sometimes constructed with an ‘s’ binding the twoconstituents, and sometimes constructed by the removal or changeof the last vowel in the first constituent. The compound splitting fa-

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx.

voured two equally long constituents over one short and one longconstituent. The minimum length of a constituent was set to four let-ters. Compound splitting was applied on tokens not found in any ofthe vocabulary lists used. If it was possible to divide a token, theconstituents as well as the semantic classes found by the terminol-ogy matching for the constituents were used as features.

A third challenge is that Swedish is more inflective than English,increasing the importance of morphologic normalisation. TheSwedish lemmatiser Granska [41] was therefore applied on thetexts to obtain lemma forms of tokens to use as features for themachine learning, as well as for being able to use normalised formsfor the terminology match. Granska was also used for obtainingpart-of-speech information.

The following features were added one by one, in the followingorder: (1) The current token. (2) The lemma form of the current to-ken. (3) The lemma form of surrounding tokens. (4) Part-of-speechof the current token and surrounding tokens. (5) The output of theterminology-based system for the current token and surroundingtokens. (6) Compound splitting for the current token. (7) Ortho-graphic features, i.e. initial upper case, all upper case or no uppercase for the current token (Fig. 1).

The use of an increasingly larger window size was evaluated forfeatures (2)–(5), as well as fuzzy terminology matching. Featuresand window sizes that lead to an improved result wereretained, whereas the others were not used. For the best featurecombination, the CRF++ regularisation hyper-parameter forbalancing between under-and over-fitting, was also varied. Forall experiments, L2-regularisation was used.

We chose to use 30-fold cross validation on the Developmentsubset for selecting features. This is more time-consuming thanusing the more standard approach of 10-fold cross validation, buthas the advantage that a larger proportion of the Developmentsubset is used as training data in each fold, thereby more closelyresembling the situation that is being optimised for, that the entireDevelopment subset is available as training data.

As the features were added incrementally, with each featureimproving the results being retained, figures from this evaluationcould not determine how much each individual feature contrib-uted to the results. Therefore, models were constructed, in whichone feature type at a time was removed while retaining all otherfeatures. 30-Fold cross validation was used also for the evaluationof these models, which were constructed with the best settings andall the best features, except the feature type whose contributionwas to be evaluated.

3.3. Final evaluation on held-out data

In order to obtain the final results, we evaluated how well amodel trained using the best features would perform in a deploy-ment setting. A CRF model was therefore trained on the entireDevelopment subset using the best features (i.e. the features forwhich best results were achieved in the feature selection process),and this model was then evaluated on the previously unused FinalEvaluation subset.

disorders, findings, pharmaceuticals and body structures from clinical text:doi.org/10.1016/j.jbi.2014.01.012

Page 6: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Table 5Inter-annotator agreement scores. PH2 shows the inter-annotator agreementbetween PH1 and PH2. CL shows the inter-annotator agreement between PH1 andCL. Disorder + Finding shows the results when merging the two classes Disorder andFinding into one class, and �num. val. means the agreement when removingnumerical values from annotations for the class Finding. The average results were alsomeasured.

PH2 CL AverageF-score F-score F-score

Disorder 0.77 0.80 0.79Finding 0.58 0.73 0.66Pharmaceutical Drug 0.88 0.92 0.90

6 M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

3.4. Error analysis

As a final step, an error analysis was carried out. The error anal-ysis was performed on NER classifications obtained when applying30-fold cross-validation with the best features on the Developmentsubset. Using the Development subset means that it is also possibleto treat the Final Evaluation subset as unseen data in future stud-ies. A manual error analysis was carried out by PH1 for the classesDisorder, Drug and Body Structure, while only errors due to incor-rect span and to confusion between classes were measured forFinding (due to the large number of instances for this class).

Body Structure 0.80 0.80 0.80

Disorder + Finding 0.72 0.84 0.78Finding � num. val. 0.61 0.73 0.67Disorder + Finding � num. val. 0.75 0.84 0.80

Table 6The average results for all four categories, compared to the average results forterminology match (macro average over the four classes Disorder, Finding, Pharma-ceutical Drug and Body Structure). The results for removing one feature type at a time,as well as for using the inflected tokens instead of the lemmatised forms, are alsoshown.

Evaluated configuration Precision Recall F-score

Baseline: Terminology match 0.700 0.582 0.623

Best average CRF settings 0.832 0.759 0.794

Without lemmatisation 0.823 0.741 0.780Best settings – current lemma 0.758 0.633 0.688Best settings – previous lemma 0.826 0.758 0.790Best settings – part-of-speech tagging 0.830 0.751 0.788Best settings – terminology match 0.848 0.688 0.759Best settings – compound splitting 0.837 0.742 0.786Best settings – orthographics 0.832 0.755 0.791

4. Results

The results consist of inter-annotator agreement scores for theannotated corpus, as well as results for the feature selection andfor the NER model evaluated on held-out data. The results weremeasured using precision, recall and F-score, all calculated withthe CoNLL 2000 script [42].

4.1. Corpus annotation

Inter-annotator agreement between the physicians (PH1 andPH2), between the main physician annotator and the computa-tional linguist (PH1 and CL), and the average results are shown inTable 5 for the four categories, as well as for Disorder and Findingmerged into one class. Annotations for the entity PharmaceuticalDrug had the highest agreement for both pairs of annotators, andthere was also a relatively high agreement for annotations of BodyStructure and Disorder, while the agreement for Finding was lower.Finding was also the only category for which there was a large dif-ference between the two pairs of annotators, with higher agree-ment between PH1 and CL than between PH1 and PH2. Whenautomatically removing all annotated Findings containing a num-ber for PH2 and CL (to make them better comply with the guide-lines), the agreement between the two physicians increased from0.58 to 0.61, but still remained lower than the agreement betweenPH1 and CL.

In the Development subset, there were a total of 27 unique en-tity types sometimes annotated as a Disorder and sometimes as aFinding: this was 3% of the total number of unique entity typesamong annotated Disorders and Findings. Among these, some wereequally frequently annotated as a Disorder and as a Finding (e.g.stress and muscle pain), whereas some more often were classifiedas a Finding (e.g. dizziness/vertigo and tachycardia). In accordancewith the annotation guidelines, these entities were annotated as aFinding when they were symptoms of another disorder, and as aDisorder when they were the main medical problem described inthe Assessment field.

4.2. Feature selection and evaluation

Feature selection was performed through 30-fold cross-validation on the Development subset, and by selecting featuresthat maximised the macro-averaged F-score over the four entitycategories. The following features and window sizes gave the bestaverage results: (1) Lemma forms for the current token and theprevious token. (2) Part-of-speech for the current token, thefollowing token and the two previous tokens. (3) Terminologymatch for the current token and the previous token. (4)The compound splitting features for the current token. (5)Orthographic features for the current token.

The features used when obtaining the best results are illus-trated with boldface in Fig. 1. Fuzzy terminology matching, usinga Levenshtein distance of one, as well as larger window sizes for

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx

lemma, part-of-speech and terminology match, were evaluated,which, however, gave slightly lower results.

The best average results using the selected features are shownin boldface in Table 6, while the boldfaced figures in Tables 7–10show the results for each individual category. Results were highabove the baseline for all categories, with the smallest differencefor Pharmaceutical Drug, which had the best baseline. Tables6–10 also show the extent to which each individual feature typecontributed to the best results, by presenting results when one fea-ture type at a time was removed, while all other features were re-tained. That is, e.g. Best settings – Orthographics shows the resultswhen all the best features except orthographics were used. Withoutlemmatisation shows the results when using the current and previ-ous token instead of the current and previous lemma. The evalua-tion showed that the current lemma was the most importantfeature for all categories, while terminology matching was the sec-ond most important feature, except for the category Finding, forwhich lemmatisation was more important. All other featuresplayed a minor role.

Results are presented for a CRF++ hyper-parameter of 6, forwhich the best results were achieved when giving the parameterinteger values between 1 and 11 (with the best average F-scoreof 0.794 at 6 and the lowest F-score of 0.779 at 1).

4.3. Final evaluation on held-out data

The best parameter and feature settings were also used whentraining a model on the entire Development subset. The resultswhen evaluating this model on the held-out Final Evaluation sub-set are shown in Table 11. All categories were recognised with anF-score that is in line with the average inter-annotator agreementscores, and the final results were also very close to those achievedon the Development subset during feature selection, which shows

disorders, findings, pharmaceuticals and body structures from clinical text:.doi.org/10.1016/j.jbi.2014.01.012

Page 7: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 7

that the results obtained during development generalise well onunseen data.

4.4. Error analysis

The results of the error analysis are shown in Tables 12 and 13.Confusions between entity categories (especially between Disorderand Finding), manual annotation errors, an incorrect span and bor-derline cases (i.e. when it was not evident whether the CRF classi-fier or the human annotator was correct) were identified errortypes among false positives as well as among false negatives. Anentity span that was longer than the annotated span was countedas a false positive, whereas a too short span was classified as a falsenegative. Abbreviations were also a source of both false positivesand negatives, as were compound words. A compound word thatwas incorrectly classified as belonging to an entity category hadeither as one of its constituents a word belonging to that category(e.g. lyme-disease-test7 and liver-values8), or had as one of its constit-uents a word that frequently occurred in compound words of thiscategory (e.g. COPD-treatment,9 which was misclassified as a Drugbecause of the constituent treatment). Compound words were alsofrequent among false negatives, both compounds of full-lengthwords and compounds with an abbreviation as one of its constitu-ents. Other false negatives were inflected words, which the lemma-tiser had failed to lemmatise, misspellings or spelling variants andentities expressed with jargon or non-standard language. Amongthe false negatives for the category Drug grouped into Incorrect: Noother reason were expressions for groups of medicines, for instanceCortisone.

5. Discussion

A comparison between the final results of the created NER mod-el and results of previous clinical NER studies answers the researchquestion of whether previous approaches for English are applicableon Swedish clinical text. The achieved NER results, as well as ananalysis of the confusion between categories for different annota-tors and for the NER classifier, also answer the research questionas to what extent it is possible to separate the category MedicalProblem into the two more granular entity categories Disorderand Finding.

5.1. Application of NER methods on Swedish clinical text

Apart from the fact that guidelines for handling compoundwords are needed, entity annotation of Swedish clinical text doesnot pose any evident challenges additional to those posed by entityannotation of English text. It is, therefore, not surprising that thereare no systematic differences between the inter-annotatoragreement figures reported here and the results from previousEnglish studies. Our reported inter-annotator agreement forPharmaceutical Drug (average F-score 0.90) was better than theaverage agreements for the category Drug or Device reported byRoberts et al. (F-scores 0.84, 0.32 and 0.59 for three differentdocument types) but slightly lower than the agreement reportedby Wang and Patrick for the comparable category Substance(F-score 0.95). Also for Body Structure (average F-score 0.80), theagreement presented here was slightly higher than the figuresreported for the category Locus by Roberts et al. (F-scores 0.78,0.75 and 0.71), but lower than the agreement for Body presentedby Wang and Patrick (F-score 0.85). For the combined category Dis-

7 In Swedish: borreliaprov.8 In Swedish: levervärden.9 In Swedish: KOL-behandling.

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx.

order + Finding, our agreement (F-score 0.78, or 0.80 when numer-ical values were removed) was lower than the agreement for thecorresponding categories Clinical Condition reported by Chapmanet al. (F-score 0.92) and Finding reported by Wang and Patrick(F-score 0.91). The agreement figures for Condition, reported byRoberts et al. (F-scores 0.81, 0.77, 0.67), however, closely matchor are lower than our agreement for Disorder + Finding. Also our in-ter-annotator agreement for the separate category Disorder (F-scoreof 0.79) was slightly higher than the agreement reported by Ogrenet al. for Disorder (F-score 0.76). Finally, it can be noted that the sim-ilarity in inter-annotator agreement figures between the pairs ofphysicians, and the physician versus the computational linguist, isalso in accordance with results from previous studies [16,17].

For NER of the annotated entities on the other hand, differencesbetween English and Swedish and between available terminologi-cal resources might affect the results. Fig. 2 gives an overview ofNER results from a number of previous studies and compares themto the results achieved here. The results achieved in these previousstudies vary, which e.g. could be attributed to variations in the ex-act definitions of entities, to the type of clinical text that is studied,as well as to the size of the used training data. The best resultsamong these studies for recognising the categories Disease andDrug were achieved by Kokkinakis and Thurin [13] with a rule-and terminology-based method using a restricted vocabulary list.These results, which are better than the results obtained here,can probably be partly explained by their study being carried outon discharge summaries, which are the opposite in terms of styleto e.g. Assessment fields. The other rule- and terminology-basedsystem evaluated on discharge summaries (by Doan et al. [26])obtained slightly lower results for Medication Names than weobtained for the entity Pharmaceutical Drug, and the rule- andterminology-based system evaluated on the corpus created by Og-ren et al. (which contains several different clinical text types[14,28]) achieved lower results for Disorder than what was ob-tained here.

There is also a machine learning study by Patrick and Li [22],obtaining results for Medication Names that are very similar tothe results we achieved for Pharmaceutical Drug, despite the factthat the study by Patrick and Li was conducted solely on dischargesummaries. Doan et al. [26] showed, however, that better resultscan be achieved using the same corpus; results that exceed thoseachieved here. Also the i2b2 2010 challenge corpus consists to alarge extent of discharge summaries, and the entity Medical Prob-lem was recognised by Jiang et al. [12] with higher precision andrecall than we obtained for the comparable category combinationDisorder + Finding. The fact that the i2b2 2010 challenge corpuscontains a considerably larger set of annotated entities probablyhad additional positive effects on the results.

The two English NER studies that are most similar to the pres-ent study, in terms of number of available annotated entities andalso in terms of clinical text types, were conducted by Wang andPatrick [10] and by Roberts et al. [24]; Wang and Patrick usingintensive care service progress notes and Roberts et al. using anumber of different clinical text types. For entity categories similarto Disorder + Finding and Pharmaceutical Drug, Wang and Patrickhad a somewhat larger set of annotated entities and also achievedsomewhat better results than presented here, while the oppositeholds true for Roberts et al., who used a smaller set of annotateddata and achieved lower results. For Body Structure, however, bothWang and Patrick and Roberts et al. present lower results thanthose achieved here, for Wang and Patrick possibly since nestedannotations were allowed, making the Body entity more difficultto recognise.

It can also be noted that, similar to previous studies, terminol-ogy matching proved to be an important feature, while in contrastto most previous studies, a small window size gave the best results.

disorders, findings, pharmaceuticals and body structures from clinical text:doi.org/10.1016/j.jbi.2014.01.012

Page 8: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Table 7The results for the category Disorder for the CRF model settings giving the bestaverage results, compared to the Disorder results for the baseline obtained byterminology matching. The results for removing one feature type at a time, as well asfor using the inflected tokens instead of the lemmatised forms, are also shown.

Evaluated configuration Precision Recall F-score

Baseline: Terminology match 0.677 0.574 0.621

Best average CRF settings 0.853 0.762 0.805

Without lemmatisation 0.841 0.743 0.789Best settings – current lemma 0.739 0.640 0.686Best settings – previous lemma 0.852 0.765 0.806Best settings – part-of-speech tagging 0.858 0.763 0.808Best settings – terminology match 0.884 0.658 0.755Best settings – compound splitting 0.860 0.746 0.799Best settings – orthographics 0.855 0.762 0.806

Table 8The results for the category Finding for the CRF model settings giving the best averageresults, compared to the Finding results for the baseline obtained by terminologymatching. The results for removing one feature type at a time, as well as for using theinflected tokens instead of the lemmatised forms, are also shown.

Evaluated configuration Precision Recall F-score

Baseline: Terminology match 0.574 0.312 0.404

Best average CRF settings 0.737 0.636 0.683

Without lemmatisation 0.720 0.608 0.659Best settings – current lemma 0.610 0.401 0.484Best settings – previous lemma 0.734 0.633 0.680Best settings – part-of-speech tagging 0.736 0.615 0.670Best settings – terminology match 0.735 0.611 0.668Best settings – compound splitting 0.737 0.615 0.671Best settings – orthographics 0.736 0.632 0.680

Table 9The results for the category Drug for the CRF model settings giving the best averageresults, compared to the Drug results for the baseline obtained by terminologymatching. The results for removing one feature type at a time, as well as for using theinflected tokens instead of the lemmatised forms, are also shown.

Evaluated configuration Precision Recall F-score

Baseline: Terminology match 0.918 0.702 0.795

Best average CRF settings 0.900 0.832 0.865

Without lemmatisation 0.898 0.822 0.858Best settings – current lemma 0.879 0.782 0.828Best settings – previous lemma 0.906 0.818 0.860Best settings – part-of-speech tagging 0.902 0.827 0.863Best settings – terminology match 0.891 0.795 0.840Best settings – compound splitting 0.904 0.810 0.855Best settings – orthographics 0.902 0.821 0.860

Table 10The results for the category Body Structure for the CRF model settings giving the bestaverage results, compared to the Body Structure results for the baseline obtained byterminology matching. The results for removing one feature type at a time, as well asfor using the inflected tokens instead of the lemmatised forms, are also shown.

Evaluated configuration Precision Recall F-score

Baseline: Terminology match 0.618 0.739 0.673

Best average CRF settings 0.836 0.808 0.822

Without lemmatisation 0.832 0.794 0.812Best settings – current lemma 0.805 0.711 0.755Best settings – previous lemma 0.812 0.816 0.814Best settings – part-of-speech tagging 0.824 0.798 0.811Best settings – terminology match 0.881 0.688 0.773Best settings – compound splitting 0.849 0.796 0.821Best settings – orthographics 0.836 0.804 0.819

Table 11The final results of the NER model. The model was trained on the Developmentsubset, using the best features, and thereafter evaluated on the Final Evaluationsubset. Finding + Disorder shows the result for the two categories Finding andDisorder merged into one category. A 95% confidence interval is calculated [43, pp.91–92, 94–96].

Precision Recall F-score

Disorder 0.80 (±0.03) 0.82 (±0.03) 0.81Finding 0.72 (±0.03) 0.65 (±0.03) 0.69Drug 0.95 (±0.02) 0.83 (±0.03) 0.88Body Structure 0.88 (±0.04) 0.82 (±0.05) 0.85Disorder + Finding 0.80 (±0.02) 0.76 (±0.02) 0.78

8 M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx

That a small window size was most optimal might be explained bythe training data not being large enough for the number of featuresthat is generated when many feature types with a large number ofpossible values are included.

In summary, the results presented here are in general in linewith or slightly lower than results for presented previous studies.This difference is, however, probably not primarily caused by thestudy being conducted on Swedish texts, but could be attributedto the fact that we used a wide definition of entities (as non-pathological findings and pharmaceuticals expressed in generalterms were included), a clinical text type for which a highly tele-graphic language is used and in which a large variety of disordersand findings occur, and that we had a smaller number of entities inour training data compared to some previous studies. The erroranalysis shows, however, that some of the identified errors belongto error types that are unlikely to occur in English text. Some of thefalse negatives caused by entities being expressed with jargon,spelling variants and abbreviations might have been avoided withmore extensive terminology resources, which are available for Eng-lish. There were also some errors caused by the lemmatiser failingto lemmatise inflected words, which indicates that an adaption ofthe lemmatiser to the medical domain is needed. Compound wordswere frequent among false negatives, including compounds withan abbreviation as one of its constituents. Compound splitting im-proved the average recall by 1.7 percentage points, but the imple-mented compound splitter was dependent on finding a constituentin either the Parole corpus or in a medical terminology, which hadthe effect that compounds containing abbreviations or medicalterms not included in a terminology were not split. Also the heuris-tics of having a minimum length of four letters for a constituentprevented compound splitting of words containing an abbrevia-tion. However, compound splitting also reduced the average preci-sion by 0.5 percentage points, indicating that an improved splittingis not enough, but has to be combined with, e.g. a larger trainingset. The false positives for compound words also show that the en-tity recognition task defined here might be somewhat more diffi-cult than the task defined for previous English studies, as e.g. theword liver in liver-values, would be defined as a Body Structure inprevious English studies, whereas we did not define constituentsof compound words as belonging to an entity category.

5.2. Division into the two categories Disorder and Finding

The inter-annotator agreement scores, as well as the resultsfrom the NER model, show that the two categories Disorder andFinding are more difficult to differentiate than the other categories.The agreement for the category Finding was low for both pairs ofannotators, 0.58 for PH1-PH2 and 0.73 for PH1-CL, while theagreement for Disorder was higher with an F-score of 0.77 forPH1-PH2 and 0.80 for PH1-CL. Merging the two classes Disorderand Finding, resulted in higher agreement between the two physi-cians (F-score 0.72) than would be the case for a weighted averageof the two classes (F-score 0.65). For the physician and the

disorders, findings, pharmaceuticals and body structures from clinical text:.doi.org/10.1016/j.jbi.2014.01.012

Page 9: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

Table 12False positives. The first four classes are confusions between the different entity types. Span error means that the model classified a longer span than what had been annotated.Manual annotation error is an evident case of a mistake made by the human annotator, whereas Borderline case is a case for which it is not evident if the model or the annotator iscorrect. Abbreviation is an abbreviated word as a false positive. Compound word with one relevant stem is a compound word that does not belong to the category, but whichincludes a stem belonging to the category or that is a common stem in words belonging to that category. Location is a physical location, such as the abbreviated word infection fordepartment of infection, and the remaining false positives were classified as Incorrect, no other reason. nm stands for not measured.

Error type Classified by the model as:

Disorder Finding Drug Body Structure

Annotated as disorder – 23% 5% 14%Annotated as finding 48% – 7% 15%Annotated as drug 1% 3% – 1%Annotated as body structure 1% 1% 0% –

Manual annotation error 8% nm 49% 56%Span error 12% 13% 0% 1%Borderline case 2% nm 3% 3%

Incorrect: Abbreviation 4% nm 11% 0%Incorrect: Compound word with one relevant stem 11% nm 8% 7%Incorrect: Location 5% nm 0% 0%Incorrect: No other reason 9% nm 17% 1%

Total number of false positives (classification instances) 138 389 75 71

Table 13False negatives. The first four classes are confusions between the different entity types. Span error means that the model classified a shorter span than what had been annotated.Manual annotation error is an evident case of a mistake made by the human annotator, whereas Borderline case is a case for which it is not evident if the model or the annotator iscorrect. Abbreviation is an abbreviated word that was not recognised, Compound word is a compound word that was not recognised and Compound word with an abbreviation is acompound word for which one of its constituents is a compound. Lemmatisation failed is when the word was inflected and for which a probable reason for it not being detected isthat the lemmatiser failed to lemmatise it correctly, Misspelling or spelling variant is when an alternative spelling of a term was used or when a term was misspelled, Jargon iswhen a non-standard version of a term was used and the remaining false negatives were classified as Incorrect, no other reason. nm stands for not measured.

Error type Annotated as:

Disorder Finding Drug Body Structure

Incorrectly classified as disorder – 9% 4% 2%Incorrectly classified as finding 31% – 8% 9%Incorrectly classified as drug 1% 1% – 0%Incorrectly classified as body structure 3% 1% 1% –

Manual annotation error 4% nm 0% 2%Span error 10% 18% 5% 5%Borderline case 3% nm 3% 0%

Incorrect: Abbreviation 5% nm 7% 2%Incorrect: Compound word 7% nm 35% 27%Incorrect: Compound word with an abbreviation 7% nm 8% 0%Incorrect: Lemmatisation failed 6% nm 2% 6%Incorrect: Misspelling or spelling variant 5% nm 2% 2%Incorrect: Jargon 8% nm 3% 1%Incorrect: No other reason 9% nm 22% 43%

Total number of false negatives (annotation instances) 292 808 155 93

M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 9

computational linguist, the result of merging the classes had aneven larger effect, with an F-score of 0.84. These figures show thatdisagreements between annotators were often with regard towhich of these two categories to use. However, in the Developmentsubset, annotated by the main annotator, only 3% of the unique en-tity types among annotated Disorders and Findings were anno-tated as Disorders in some contexts and as Findings in othercontexts, which supports the meaningfulness of this more granulardivision.

The error analysis of the NER model shows that in many of thecases for which the system failed to detect a Disorder, it had in-stead classified the entity as a Finding (31% of the false negativesfor Disorder), and vice versa (9% of the false negatives for Finding).Whether it is meaningful to divide Medical Problem into the twomore granular categories Disorder and Finding is dependent onthe intended application of the NER system. An automatically gen-erated high-level patient summary might for instance place highdemands on recall of the category Disorder. This means that recog-nised entities classified as belonging to the category Finding (orentities for which the model is uncertain of whether to classify it

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx.

as a Finding or a Disorder) also ought to be included in the sum-mary, thereby boosting recall for included Disorders. The resultsachieved, i.e. recognising the category Disorder with a precisionof 80% and Finding with a precision of 72% are, however, likely tobe high enough for it to be meaningful to make a distinction be-tween these categories when mining for new medical knowledge(e.g. since known co-morbidities have been successfully extractedfrom structured data that contains inaccuracies [8,44]). Having ac-cess to a NER system that distinguishes between disorders andfindings makes it possible to separately mine for co-morbidity rela-tions and disorder-finding relations.

5.3. Limitations

The inter-annotator agreement shows how reliable the evalua-tion data is for measuring the performance of the evaluated sys-tem, and gives an indication of the difficulty of recognising thefour entities, especially of their relative difficulty. The agreementscores cannot, however, be used as an absolute upper ceiling forthe performance of the NER system, since the system mimics the

disorders, findings, pharmaceuticals and body structures from clinical text:doi.org/10.1016/j.jbi.2014.01.012

Page 10: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

0.7

0.8

0.9

0 2000 4000 6000 12000

Body StructurePharmaceutical DrugDisorder + Finding

Training instances

Present study

Jiang et al.

Wang and Patrick

Roberts et al.

F-score

Traininginstancesunknown

Rule-based

Doanet al.

Doanet al.

Patrick & Li

Kokkinakis&Thulin

Disorder

Savovaet al.

Fig. 2. Comparison to a number of previous clinical NER studies. Results from thesame studies are connected with a line: a solid line for the present study and dashedlines for previous studies. The first column from the left shows the result of threerule-base studies, and the second column shows the result of two machine learningstudies for which the number of used training instances were not reported. The restof the diagram shows the result of a number of machine learning studies for whichthe number of training instances were reported. The entity names of the presentstudy are used for denoting comparable entity types in previous studies.

10 M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

behaviour of one single annotator, which might be easier thanagreeing on how to annotate given the annotation guidelines.

Another limitation concerns the involvement of the authors ofthis paper in the annotation and feature experiments. PH1 andPH2 might have been biased towards producing annotations whichthey suspected would be easier for the NER system to detect. How-ever, as neither of these two annotators were involved in the devel-opment of the NER system, it is unlikely that they knew enough ofthe system for this possible bias to have any large effects on the re-sults. CL, on the other hand, who annotated half of the evaluationdata, was the main person responsible for the NER experiments.However, as only PH1 was involved in the construction of the finalversion of the evaluation data, the risk that CL biased the evalua-tion data is likely to be a minor one.

5.4. Future directions

Future work includes further improving the NER system (e.g. byfurther adapting it to clinical Swedish by improved compoundsplitting and lemmatisation) as well as adapting the system toother clinical text types. As the relatively small size of the trainingdata might have influenced the results, a possible future directioncould be to provide more annotated data. However, since annotat-ing data is costly, a more important contribution would be to studyhow the results can be improved, or how the constructed modelcan be applied to another clinical domain, with a minimum ofadditional data. This could, for instance, be achieved by using fea-tures from unsupervised methods [11] or by cleverly selecting thedata to annotate by using active learning [45].

6. Conclusion

This study has shown that clinical NER methods previouslyapplied on English are successful also on Swedish clinical text.The category Disorder was recognised with an F-score of 0.81;Finding with an F-score of 0.69; Drug with an F-score of 0.88;Body Structure with an F-score of 0.85; and the combinationDisorder + Finding with an F-score of 0.78. These results are in line

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx

with, or slightly lower than, published results for similar studies onEnglish texts. The slightly lower results achieved here could prob-ably to a large extent be explained by the smaller size of the train-ing data and by the fact that our study was conducted on clinicaltext extracted from Assessment fields and that wide definitionsof the entity categories were used. A small proportion of the errorsmade by the NER system were, however, errors less likely to occurin an English text, such as the lemmatiser failing to lemmatisemedical terms, errors caused by compounding and by compoundscontaining an abbreviated constituent. The smaller size of the avail-able Swedish vocabularies might also have affected the results.

The study has also shown that a distinction between the twomore granular categories Disorder and Finding is sometimes diffi-cult to make, but that the NER results for the two separate catego-ries are high enough for this separation to be meaningful for someapplications. A NER system separating disorders and findingscould, for instance, be used for knowledge mining of co-morbidityrelations and of disorder-finding relations.

Authors’ contributions

MS was responsible for the overall design of the study and forthe NER part, while MK was responsible for the annotation part.MK developed the annotation guidelines with assistance fromMS, and GN revised them. MK was the main annotator, while MSand GN annotated subsets of the data. MK also carried out the erroranalysis. MS designed and carried out the NER experiments, withfeedback from HD. HD drafted parts of the background, while MSdrafted the rest of the manuscript with feedback from the otherauthors. All authors read and approved the final manuscript.

Acknowledgments

This work was partly supported by the Swedish Foundationfor Strategic Research through the project High-PerformanceData Mining for Drug Effect Detection (Ref. No. IIS11-0053) atStockholm University, Sweden. It was also partly supported byVårdal Foundation. We are very grateful to the reviewers for theirmany detailed and constructive comments, and we would like tothank Aron Henriksson and Magnus Ahltorp for fruitful discussionson the design of the study.

References

[1] Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting informationfrom textual documents in the electronic health record: a review of recentresearch. Yearb Med Inform 2008:128–44.

[2] Hallett C, Power R, Scott D. Summarisation and visualisation of e-health datarepositories. In: UK E-science all-hands meeting. Nottingham, UK.

[3] Kvist M, Skeppstedt M, Velupillai S, Dalianis H. Modeling humancomprehension of Swedish medical records for intelligent access andsummarization systems – future vision, a physician’s perspective. In:Proceedings of SHI 2011, scandinavian health informatics meeting. p. 31–5.

[4] Chapman WW, Christensen LM, Wagner MM, Haug PJ, Ivanov O, Dowling JN,et al. Classifying free-text triage chief complaints into syndromic categorieswith natural language processing. Artif Intell Med 2005;33:31–40.

[5] Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, Hansen T, et al.Using electronic patient records to discover disease correlations and stratifypatient cohorts. PLoS Comput Biol 2011;7.

[6] Eriksson R, Jensen PB, Frankild S, Jensen LJ, Brunak S. Dictionary constructionand identification of possible adverse drug events in danish clinical narrativetext. J Am Med Inform Assoc 2013.

[7] Jurafsky D, Martin JH. Speech and language processing: an introduction tonatural language processing, computational linguistics and speechrecognition. Prentice Hall; 2008.

[8] Tanushi H, Dalianis H, Nilsson G. Calculating prevalence of comorbidity andcomorbidity combinations with diabetes in hospital care in Sweden using ahealth care record database. In: LOUHI, third international workshop on healthdocument text mining and information analysis. p. 59–65.

[9] Cao H, Markatou M, Melton GB, Chiang MF, Hripcsak G. Mining a clinical datawarehouse to discover disease-finding associations using co-occurrencestatistics. AMIA Annu Symp Proc 2005:106–10.

disorders, findings, pharmaceuticals and body structures from clinical text:.doi.org/10.1016/j.jbi.2014.01.012

Page 11: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

M. Skeppstedt et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 11

[10] Wang Y, Patrick J. Cascading classifiers for named entity recognition in clinicalnotes. In: Proceedings of the workshop on biomedical information extraction.p. 42–9.

[11] de Bruijn B, Cherry C, Kiritchenko S, Martin JD, Zhu X. Machine-learnedsolutions for three stages of clinical information extraction: the state of the artat i2b2 2010. J Am Med Inform Assoc 2011;18:557–62.

[12] Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, et al. A study ofmachine-learning-based approaches to extract clinical entities and theirassertions from discharge summaries. J Am Med Inform Assoc 2011.

[13] Kokkinakis D, Thurin A. Identification of entity references in hospital dischargeletters. In: Proceedings of the 16th nordic conference of computationallinguistics (NODALIDA). Estonia. p. 329–32.

[14] Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayoclinical text analysis and knowledge extraction system (cTAKES): architecture,component evaluation and applications. J Am Med Inform Assoc 2010;17:507–13.

[15] Albright D, Lanfranchi A, Fredriksen A, Styler 4th WF, Warner C, Hwang JD,et al. Towards comprehensive syntactic and semantic annotations of theclinical narrative. J Am Med Inform Assoc 2013;20:922–30. http://dx.doi.org/10.1136/amiajnl-2012-001317. Epub 2013 Jan 25.

[16] Chapman WW, Dowling JN, Hripcsak G. Evaluation of training with anannotation schema for manual annotation of clinical conditions fromemergency department reports. Int J Med Inform 2008;77:107–13. Epub2007 Feb 20.

[17] Roberts A, Gaizauskas R, Hepple M, Demetriou G, Guo Y, Roberts I, et al.Building a semantically annotated corpus of clinical texts. J Biomed Inform2009;42:950–66.

[18] Wang Y. Annotating and recognising named entities in clinical notes. In:Proceedings of the ACL-IJCNLP 2009 student research workshop, ACLstudent’09. Stroudsburg (PA), USA: Association for Computational Linguistics; 2009. p.18–26.

[19] Ogren P, Savova G, Chute C. Constructing evaluation corpora for automatedclinical named entity recognition. In: Proceedings of the sixth internationallanguage resources and evaluation (LREC’08). Marrakech, Morocco: EuropeanLanguage Resources Association (ELRA); 2008. p. 3143–9.

[20] Uzuner O, Solti I, Cadag E. Extracting medication information from clinical text.J Am Med Inform Assoc 2010;17:514–8.

[21] Uzuner Ö, Solti I, Xia F, Cadag E. Community annotation experiment for groundtruth generation for the i2b2 medication challenge. J Am Med Inform Assoc2010;17:519–23.

[22] Patrick J, Li M. High accuracy information extraction of medicationinformation from clinical notes: 2009 i2b2 medication extraction challenge.J Am Med Inform Assoc 2010;17:524–7.

[23] Uzuner Ö, South B, Shen S, DuVall S. 2010 i2b2/va Challenge on concepts,assertions, and relations in clinical text. J Am Med Inform Assoc 2011;18:552–6.

[24] Roberts A, Gaizasukas R, Hepple M, Guo Y. Combining terminology resourcesand statistical methods for entity recognition: an evaluation. In: Proceedingsof the sixth international conference on language resources and evaluation(LREC’08). Marrakech, Morocco: European Language Resources Association(ELRA); 2008. p. 2974–9.

[25] i2b2/VA, 2010 i2b2/va challenge evaluation, concept annotation guidelines;2010 <https://www.i2b2.org/NLP/Relations/assets/Concept%20Annotation%20Guideline.pdf>.

[26] Doan S, Bastarache L, Klimkowski S, Denny JC, Xu H. Integrating existingnatural language processing tools for medication extraction from dischargesummaries. J Am Med Inform Assoc 2010;17:528–31.

Please cite this article in press as: Skeppstedt M et al. Automatic recognition ofAn annotation and machine learning study. J Biomed Inform (2014), http://dx.

[27] Doan S, Collier N, Xu H, Pham HD, Tu MP. Recognition of medicationinformation from discharge summaries using ensembles of classifiers. BMCMed Inform Dec Mak 2012;12:36.

[28] Kipper-Schuler K, Kaggal V, Masanz JJ, Ogren PV, Savova GK. System evaluationon a named entity corpus from clinical notes. In: Proceedings of the sixthinternational conference on language resources and evaluation (LREC’08).Marrakech, Morocco. p. 3007–11.

[29] Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilisticmodels for segmenting and labeling sequence data. In: Proceedings of the 18thinternational conference on machine learning. San Francisco, CA: MorganKaufman; 2001. p. 282–9.

[30] Kudo T. CRF++: yet another CRF toolkit; 2012 <http://crfpp.sourceforge.net/>[accessed 15.06.12].

[31] Dalianis H, Hassel M, Velupillai S. The Stockholm EPR corpus – characteristicsand some initial findings. In: 14th International symposium for healthinformation management research on proceedings of ISHIMR 2009,evaluation and implementation of e-health and health informationinitiatives: international perspectives. Kalmar, Sweden. p. 243–9.

[32] International Health Terminology Standards Development Organisation,IHTSDO, SNOMED CT style guide: clinical findings; 2008 <Http://www.ihtsdo.org> [accessed 24.01.11].

[33] Ogren PV. Knowtator: a protégé plug-in for annotated corpus construction. In:Proceedings of the 2006 conference of the North American chapter of theassociation for computational linguistics on human languagetechnology. Morristown (NJ), USA: Association for ComputationalLinguistics; 2006. p. 273–5.

[34] Artstein R, Poesio M. Inter-coder agreement for computational linguistics.Comput Linguist 2008;34:555–96.

[35] Hripcsak G, Rothschild AS. Technical brief: agreement, the f-measure, andreliability in information retrieval. J Am Med Inform Assoc 2005;12:296–8.

[36] FASS. Fass.se; 2012 <http://www.fass.se> [accessed 27.08.12].[37] Skeppstedt M, Kvist M, Dalianis H. Rule-based entity recognition and coverage

of SNOMED CT in Swedish clinical text. In: Calzolari N, Choukri K, Declerck T,Dogan MU, Maegaard B, Mariani J, et al., editors. Proceedings of the eightinternational conference on language resources and evaluation(LREC’12). Istanbul, Turkey: European Language Resources Association(ELRA); 2012. p. 1250–7.

[38] ul Muntaha S, Skeppstedt M, Kvist M, Dalianis H. Entity recognition ofpharmaceutical drugs in Swedish clinical text. In: Proceedings of SLTC 2012the fourth swedish language technology conference. p. 77–8.

[39] Karolinska Institutet, Hur man använder den svenska MeSHen (In Swedish,translated as: How to use the Swedish MeSH); 2012 <http://mesh.kib.ki.se/swemesh/manual_se.html> [accessed 10.03.12].

[40] Gellerstam M, Cederholm Y, Rasmark T. The bank of Swedish. In: The 2ndinternational conference on language resources and evaluation, LREC 2000.Athens, Greece. p. 329–33.

[41] Carlberger J, Kann V. Implementing an efficient part-of-speech tagger. Softw–Pract Exper 1999;29:815–32.

[42] CoNLL-2000; 2000 <http://www.cnts.ua.ac.be/conll2000/chunking/>.[43] Campbell MJ, Machin D, Walters SJ. Medical statistics: a textbook for the

health sciences. 4th ed. Chichester: Wiley; 2007.[44] Socialstyrelsen. Diagnosgranskningar utförda i Sverige 1997–2005 samt ra_d

inför granskning; 2006.[45] Settles B, Craven M. An analysis of active learning strategies for sequence

labeling tasks. In: Proceedings of the conference on empirical methods innatural language processing, EMNLP ’08. Stroudsburg (PA), USA: Associationfor Computational Linguistics; 2008. p. 1070–9.

disorders, findings, pharmaceuticals and body structures from clinical text:doi.org/10.1016/j.jbi.2014.01.012