a rule-based nlp system in tagging and categorizing phenotype variables in dbgap
TRANSCRIPT
A Rule-Based NLP System in Tagging and A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaPCategorizing Phenotype Variables in dbGaP
Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh,
Neda Alipanah and Hyeon-Eui Kim
Division of Biomedical InformaticsUC San Diego
AMIA 2013, Washington DC, 11/18/2013
RoadmapRoadmap
• Background o dbGaP
o Challenges in using dbGaP
o pFINDR program
• Phenotype standardization in dbGaP
• PhenDisco system
• Performance evaluation
2
NCBI’s database of Genotypes and Phenotypes (dbGaP)NCBI’s database of Genotypes and Phenotypes (dbGaP)
“dbGaP was developed to
archive and distribute the
results of studies that have
investigated the interaction
of genotype and phenotype”
Until 11/14/2013:
-411 top-level studies
-139,238 phenotypes
variables
-2,816 datasets
-3,895 analyses
3
What topics dbGaP users are focusing on?What topics dbGaP users are focusing on?
Most frequent phenotype topics from the dbGaP data requests
GeneticDisease CongenitalAbnormality (8.6%)
Cardiovascular Disease(8.1%)
14,287 data requests from dbGaP Website
5
http://www.ncbi.nlm.nih.gov/gap
9/5/13 6
http://www.ncbi.nlm.nih.gov/gap
9/5/13 7
http://www.ncbi.nlm.nih.gov/gap
pFINDR pFINDR (phenotype Finding IN Data Repositories)(phenotype Finding IN Data Repositories)
8
• Funded by NHLBI/NIH
• To facilitate dbGaP use by improving accuracy and completeness of search returns – Standardized existing phenotype variables
– Searchable study related information
Related workRelated work
9
PhenX: Define 287 most important phenotypes and manually mapped into 16 dbGaP studies
eMERGE: Standardize EMR phenotypesSemi-automatic process: First phenotypes are automatically mapped into standardized vocabularies , then outputs are returned and curated by users
Our approachOur approach
Using NLP to standardize phenotype variables in dbGaP
Integrating NLP components into a new phenotype search tool for dbGaP
dbGaPFree text search
Structured (advanced) search
Unsorted, flat list results
Data user
Study DescriptionAnnotator
Phenotype
Variable Annotator
Ontology
Mapper
Query Parser
Ranking Algorithms
Standardization & annotation
sdGaP
PhenDisco
sdGaP contains standardized
phenotype variables
Ranked results
Structured Search
Free text search
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and subject of information
Identify semantic category of phenotypes
Phenotype variables
Phenotype variables
TaggerTagger CategorizerCategorizer
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM 1
medical examination
study subject Smoking History; Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother (person)
Demographics
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and subject of information
Identify semantic category of phenotypes
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM 1
medical examination
study subject Smoking History; Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother (person)
Demographics
Phenotype variables
Phenotype variables
TaggerTagger CategorizerCategorizer
Topic: Main theme of phenotype variablesSubject of information: Bearer of the variable
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and subject of information
Identify semantic category of phenotypes
Phenotype variables
Phenotype variables
TaggerTagger CategorizerCategorizer
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM 1
medical examination
study subject Smoking History; Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother (person)
Demographics
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Variable Descriptions
Variable Descriptions
Normali-zation
Normali-zation
MetaMapProcessingMetaMapProcessing
Semantic Role Assignment
Semantic Role Assignment
TopicFilteringTopic
FilteringVariable
CategorizationVariable
Categorization
• Spell out abbreviations and short hand expressions
• Drop question numbers and other unimportant characters
• Generate CUIs, concept names, semantic types
• Semantic types and keyword- based role identification
• Keep concepts that match SNOMED-CT clinical findings
• Remove problematic concepts
• Semantic types and keyword-based categorization
15 semantic categories are selected based on semantic types from MetaMap: Demographics, Medical History, Clinical Attribute, Medication, Lab Tests from two domain experts
TaggerTagger CategorizerCategorizer
Creating an abbreviation list from dbGaPCreating an abbreviation list from dbGaP
15
bp blood pressure bmi body mass indexbpm beats per minutebw body weightdbp diastolic blood pressurehbp high blood pressurehtn hypertensionhr heart rateHt heightlb poundsrr respiration ratesbp systolic blood pressuretemp temperature
TPR temperature, pulse, respirationwt weightyr yearvs vital signs
We compiled and reviewed a list of abbreviation in dbGaP, original contain 50 abbreviations, latest version contains 520 abbreviations
Rule Example
1if # after type, please keep this number type 1 diabetes
type I diabetestype 2 diabetes
Glycogen storage disease type I
type 1 hypersensitivity diseases
2if # after grade, please keep this number grade 1 Dupuytren's disease
3if # after stage, please keep this number stage 1 chronic kidney disease
4if # after bipolar, please keep this number bipolar 1 disorder
bipolar I disorderbipolar II disorderbiporlar 2 disorder
5if # after class, please keep this number class I and II Newcastle disease
class 1 Newcastle disease 16
Remove number with exceptionsRemove number with exceptions
17
IF CandidatePreferred contains "gender" or "sex"AND SemType = organism attributeTHEN Topic=Gender
IF Topic concepts = Pharmacologic SubstanceORVariable description contains a word “medication”THEN Type = Medication
Rule for Tagger Rule for Categorizer
Rules for tagger and categorizer: examplesRules for tagger and categorizer: examples
Two domain experts reviewed and created rules from 300 randomly unique phenotype variables
77 age mom diagnosed–stroke (tia)
age mother diagnosed stroke (tia) • C0001779:Age [Organism
Attribute]• C0026591:Mother (Mother
(person)) [Family Group]• C0038454:Stroke
(Cerebrovascular accident) [Disease or Syndrome]
• ‘Diagnosed’
• TOPIC: Age (C0001779)• Subject of Information:
Mother (C0026591)
MetaMap
Example of taggingExample of tagging
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Variable Descriptions
Variable Descriptions
Normali-zation
Normali-zation
MetaMapProcessingMetaMapProcessing
Semantic Role Assignment
Semantic Role Assignment
TopicFilteringTopic
FilteringVariable
CategorizationVariable
Categorization
116,957 phenotypes mapped to Topic
104,172 phenotypes mapped to Category
135,608 variables
TaggerTagger CategorizerCategorizer
Evaluation: - Random sample of 500 unique phenotypes - Reviewed by 3 domain experts
73% accuracy for topic 71% accuracy for category
Semantic category of phenotypes in dbGaP Semantic category of phenotypes in dbGaP
20
(as of July 1, 2013)
Mapping FailuresMapping Failures
21
a. Unprocessed by Metamap14 c first arm othervessel max l lat obliq
a. Lexical problem. Items with incorrect lexical forms including typos hba1 c collection date month 057 fateat gm
a. Id variables or some administrative variablesform numberf124 documentation used form
a. Our rules do not coveryears treated pet for fleas what is your first language
Free text Query parser
sdGaP
Relevant studies
Ranked studies
NLP tools + MetaMap
Information model mapping
dbGaP
PhenDisco: Put-it-all-togetherPhenDisco: Put-it-all-together
BM25 ranking algorithm
Standardized
phenotypes
Standardized
phenotypes
Doan, S, Lin KW, Conway M, Ohno-Machado et al. "PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)”, JAMIA, 2013, doi:10.1136/amiajnl-2013-001882.
PhenDisco systemPhenDisco systemSearch oTerm auto-completeoSynonym expansion
23
DisplayoKeyword highlighting oRanking by relevanceoFilter by study metadataoCross-link related studies
Export to ExceloSelected study metadataoSelected phenotype variables
Search by titles, platform, study
24
Advanced Search
PhenDisco vs dbGaP EntrezPhenDisco vs dbGaP Entrez
25
Basic SearchdbGaP PhenDisco
Recall Precision Recall Precision
COPD 100 % 41.67% 80.00% 100 %
“macular degeneration” AND white
100 % 42.86% 100 % 85.71%
“breast cancer” AND “breast density”
100 % 66.67% 50.00% 100 %
schizophrenia 100 % 46.88% 86.67% 92.86%
cardiomyopathy 100 % 35.00% 100 % 100 %
Average 100 % 46.61% 83.33% 95.71%
Average F-measure 0.64 0.89
(as of July 7, 2013)
Summary & Future workSummary & Future work
• A rule-based approach is a simple yet efficient way to standardize phenotype variables in dbGaP
• Integration to machine learning methods will be investigated
• Identification of similar variables is in progress!
AcknowledgementsAcknowledgements
• Lucila Ohno-Machado (PI)• Other PhenDisco team members:
o Mike Conwayo Alex Hsieho Stephanie Feudjido Feupeo Asher Garlando Mindy Rosso Xiaoqian Jiango Jing Zhang
• Early contributorso Wendy Chapmano Melissa Tharpo Jihoon Kim
• Collaborator:o Hua Xu
• SAB member and NHLBI officers• Funding: UH2HL108785 from NHLBI/NIH 27
Questions?Project Homepage: http://pfindr.net
PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1
Contact:
Source code and database of PhenDisco are publicly available