a rule-based nlp system in tagging and categorizing phenotype variables in dbgap

28
A Rule-Based NLP System in Tagging and A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP Categorizing Phenotype Variables in dbGaP Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh, Neda Alipanah and Hyeon-Eui Kim Division of Biomedical Informatics UC San Diego AMIA 2013, Washington DC, 11/18/2013

Upload: division-of-biomedical-informatics-uc-san-diego

Post on 16-Jul-2015

76 views

Category:

Science


3 download

TRANSCRIPT

Page 1: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

A Rule-Based NLP System in Tagging and A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaPCategorizing Phenotype Variables in dbGaP

Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh,

Neda Alipanah and Hyeon-Eui Kim

Division of Biomedical InformaticsUC San Diego

AMIA 2013, Washington DC, 11/18/2013

Page 2: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

RoadmapRoadmap

• Background o dbGaP

o Challenges in using dbGaP

o pFINDR program

• Phenotype standardization in dbGaP

• PhenDisco system

• Performance evaluation

2

Page 3: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

NCBI’s database of Genotypes and Phenotypes (dbGaP)NCBI’s database of Genotypes and Phenotypes (dbGaP)

“dbGaP was developed to

archive and distribute the

results of studies that have

investigated the interaction

of genotype and phenotype”

Until 11/14/2013:

-411 top-level studies

-139,238 phenotypes

variables

-2,816 datasets

-3,895 analyses

3

Page 4: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

What topics dbGaP users are focusing on?What topics dbGaP users are focusing on?

Most frequent phenotype topics from the dbGaP data requests

GeneticDisease CongenitalAbnormality (8.6%)

Cardiovascular Disease(8.1%)

14,287 data requests from dbGaP Website

Page 5: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

5

http://www.ncbi.nlm.nih.gov/gap

Page 6: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

9/5/13 6

http://www.ncbi.nlm.nih.gov/gap

Page 7: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

9/5/13 7

http://www.ncbi.nlm.nih.gov/gap

Page 8: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

pFINDR pFINDR (phenotype Finding IN Data Repositories)(phenotype Finding IN Data Repositories)

8

• Funded by NHLBI/NIH

• To facilitate dbGaP use by improving accuracy and completeness of search returns – Standardized existing phenotype variables

– Searchable study related information

Page 9: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Related workRelated work

9

PhenX: Define 287 most important phenotypes and manually mapped into 16 dbGaP studies

eMERGE: Standardize EMR phenotypesSemi-automatic process: First phenotypes are automatically mapped into standardized vocabularies , then outputs are returned and curated by users

Page 10: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Our approachOur approach

Using NLP to standardize phenotype variables in dbGaP

Integrating NLP components into a new phenotype search tool for dbGaP

dbGaPFree text search

Structured (advanced) search

Unsorted, flat list results

Data user

Study DescriptionAnnotator

Phenotype

Variable Annotator

Ontology

Mapper

Query Parser

Ranking Algorithms

Standardization & annotation

sdGaP

PhenDisco

sdGaP contains standardized

phenotype variables

Ranked results

Structured Search

Free text search

Page 11: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Identify topic and subject of information

Identify semantic category of phenotypes

Phenotype variables

Phenotype variables

TaggerTagger CategorizerCategorizer

Variable Topic Subject Category

Gender of the participant Gender participant Demographics

CIGARETTES/DAY, EXAM 1

medical examination

study subject Smoking History; Healthcare Activity Finding

Weight in kg. at baseline weighing patient study subject Clinical Attributes

AGE OF LIVING MOTHER Age mother (person)

Demographics

Page 12: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Identify topic and subject of information

Identify semantic category of phenotypes

Variable Topic Subject Category

Gender of the participant Gender participant Demographics

CIGARETTES/DAY, EXAM 1

medical examination

study subject Smoking History; Healthcare Activity Finding

Weight in kg. at baseline weighing patient study subject Clinical Attributes

AGE OF LIVING MOTHER Age mother (person)

Demographics

Phenotype variables

Phenotype variables

TaggerTagger CategorizerCategorizer

Topic: Main theme of phenotype variablesSubject of information: Bearer of the variable

Page 13: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Identify topic and subject of information

Identify semantic category of phenotypes

Phenotype variables

Phenotype variables

TaggerTagger CategorizerCategorizer

Variable Topic Subject Category

Gender of the participant Gender participant Demographics

CIGARETTES/DAY, EXAM 1

medical examination

study subject Smoking History; Healthcare Activity Finding

Weight in kg. at baseline weighing patient study subject Clinical Attributes

AGE OF LIVING MOTHER Age mother (person)

Demographics

Page 14: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Variable Descriptions

Variable Descriptions

Normali-zation

Normali-zation

MetaMapProcessingMetaMapProcessing

Semantic Role Assignment

Semantic Role Assignment

TopicFilteringTopic

FilteringVariable

CategorizationVariable

Categorization

• Spell out abbreviations and short hand expressions

• Drop question numbers and other unimportant characters

• Generate CUIs, concept names, semantic types

• Semantic types and keyword- based role identification

• Keep concepts that match SNOMED-CT clinical findings

• Remove problematic concepts

• Semantic types and keyword-based categorization

15 semantic categories are selected based on semantic types from MetaMap: Demographics, Medical History, Clinical Attribute, Medication, Lab Tests from two domain experts

TaggerTagger CategorizerCategorizer

Page 15: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Creating an abbreviation list from dbGaPCreating an abbreviation list from dbGaP

15

bp blood pressure bmi body mass indexbpm beats per minutebw body weightdbp diastolic blood pressurehbp high blood pressurehtn hypertensionhr heart rateHt heightlb poundsrr respiration ratesbp systolic blood pressuretemp temperature

TPR temperature, pulse, respirationwt weightyr yearvs vital signs

We compiled and reviewed a list of abbreviation in dbGaP, original contain 50 abbreviations, latest version contains 520 abbreviations

Page 16: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Rule Example

1if # after type, please keep this number type 1 diabetes

type I diabetestype 2 diabetes

Glycogen storage disease type I

type 1 hypersensitivity diseases

2if # after grade, please keep this number grade 1 Dupuytren's disease

3if # after stage, please keep this number stage 1 chronic kidney disease

4if # after bipolar, please keep this number bipolar 1 disorder

bipolar I disorderbipolar II disorderbiporlar 2 disorder

5if # after class, please keep this number class I and II Newcastle disease

class 1 Newcastle disease 16

Remove number with exceptionsRemove number with exceptions

Page 17: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

17

IF CandidatePreferred contains "gender" or "sex"AND SemType = organism attributeTHEN Topic=Gender

IF Topic concepts = Pharmacologic SubstanceORVariable description contains a word “medication”THEN Type = Medication

Rule for Tagger Rule for Categorizer

Rules for tagger and categorizer: examplesRules for tagger and categorizer: examples

Two domain experts reviewed and created rules from 300 randomly unique phenotype variables

Page 18: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

77 age mom diagnosed–stroke (tia)

age mother diagnosed stroke (tia) • C0001779:Age [Organism

Attribute]• C0026591:Mother (Mother

(person)) [Family Group]• C0038454:Stroke

(Cerebrovascular accident) [Disease or Syndrome]

• ‘Diagnosed’

• TOPIC: Age (C0001779)• Subject of Information:

Mother (C0026591)

MetaMap

Example of taggingExample of tagging

Page 19: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Variable Descriptions

Variable Descriptions

Normali-zation

Normali-zation

MetaMapProcessingMetaMapProcessing

Semantic Role Assignment

Semantic Role Assignment

TopicFilteringTopic

FilteringVariable

CategorizationVariable

Categorization

116,957 phenotypes mapped to Topic

104,172 phenotypes mapped to Category

135,608 variables

TaggerTagger CategorizerCategorizer

Evaluation: - Random sample of 500 unique phenotypes - Reviewed by 3 domain experts

73% accuracy for topic 71% accuracy for category

Page 20: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Semantic category of phenotypes in dbGaP Semantic category of phenotypes in dbGaP

20

(as of July 1, 2013)

Page 21: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Mapping FailuresMapping Failures

21

a. Unprocessed by Metamap14 c  first arm othervessel max l lat obliq 

a. Lexical problem. Items with incorrect lexical forms including typos hba1 c collection date  month 057 fateat gm 

a. Id variables or some administrative variablesform numberf124  documentation used form 

a. Our rules do not coveryears treated pet for fleas what is your first language 

Page 22: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Free text Query parser

sdGaP

Relevant studies

Ranked studies

NLP tools + MetaMap

Information model mapping

dbGaP

PhenDisco: Put-it-all-togetherPhenDisco: Put-it-all-together

BM25 ranking algorithm

Standardized

phenotypes

Standardized

phenotypes

Doan, S, Lin KW, Conway M, Ohno-Machado et al. "PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)”, JAMIA, 2013, doi:10.1136/amiajnl-2013-001882.

Page 23: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

PhenDisco systemPhenDisco systemSearch oTerm auto-completeoSynonym expansion

23

DisplayoKeyword highlighting oRanking by relevanceoFilter by study metadataoCross-link related studies

Export to ExceloSelected study metadataoSelected phenotype variables

Search by titles, platform, study

Page 24: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

24

Advanced Search

Page 25: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

PhenDisco vs dbGaP EntrezPhenDisco vs dbGaP Entrez

25

Basic SearchdbGaP PhenDisco

Recall Precision Recall Precision

COPD 100 % 41.67% 80.00% 100 %

“macular degeneration” AND white

100 % 42.86% 100 % 85.71%

“breast cancer” AND “breast density”

100 % 66.67% 50.00% 100 %

schizophrenia 100 % 46.88% 86.67% 92.86%

cardiomyopathy 100 % 35.00% 100 % 100 %

Average 100 % 46.61% 83.33% 95.71%

Average F-measure 0.64 0.89

(as of July 7, 2013)

Page 26: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Summary & Future workSummary & Future work

• A rule-based approach is a simple yet efficient way to standardize phenotype variables in dbGaP

• Integration to machine learning methods will be investigated

• Identification of similar variables is in progress!

Page 27: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

AcknowledgementsAcknowledgements

• Lucila Ohno-Machado (PI)• Other PhenDisco team members:

o Mike Conwayo Alex Hsieho Stephanie Feudjido Feupeo Asher Garlando Mindy Rosso Xiaoqian Jiango Jing Zhang

• Early contributorso Wendy Chapmano Melissa Tharpo Jihoon Kim

• Collaborator:o Hua Xu

• SAB member and NHLBI officers• Funding: UH2HL108785 from NHLBI/NIH 27

Page 28: A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

Questions?Project Homepage: http://pfindr.net

PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1

Contact:

[email protected]

[email protected]

Source code and database of PhenDisco are publicly available