square wheels: electronic medical records for discovery research in rheumatoid arthritis robert m....
Post on 29-Dec-2015
215 Views
Preview:
TRANSCRIPT
Square wheels: electronic medical records for discovery research in rheumatoid arthritis
Robert M. Plenge, M.D., Ph.D.
October 30, 2009
NCRR sponsored "Using EHR Data for Discovery
Research" HARVARDMEDICAL SCHOOL
Key questions
• What are the regulatory obstacles impacting your work?
• What are the resource needs required to replicate your work at other institutions?
• What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?
Key questions
How can I implement your approach, and how much
better is it?
genotype
phenotype
clinical care
genotype
phenotype
clinical care
bottleneck
Raychaudhuri et al in press Nature Genetics
October 2009: >30 RA risk loci
20031978 1987 20052004
PTPN22
2008
“shared epitope”hypothesis
HLADR4
2007
PADI4 CTLA4
TNFAIP3STAT4TRAF1-C5IL2-IL21
CD40CCL21CD244IL2RBTNFRSF14PRKCQPIP4K2CIL2RAAFF3
Latest GWAS in 25,000 case-control samples with replication in 20,000 additional samples: >10 new
loci
2009
RELBLKTAGAPCD28TRAF6PTPRCFCGR2APRDM1CD2-CD58
Together explain ~35% of the genetic burden of
disease
genotype
phenotype
clinical carebottleneck
Genetic predictors of response to anti-TNF
therapy in RA
PTPRC/CD45 allelen=1,283 patients
P=0.0001
Submitted to Arth & Rheum
How can we collect DNA and detailed clinical data on >20,000 RA patients?
What are the options for collecting clinical
data and DNA for genetic studies?
Options for clinical + DNA
design Clinical
data
DNA Sample size
cost
clinical trial
+++ +++ + $$$
registry ++ +++ ++ $$
claims data
+ n/a +++ $
EMR ++ +++ +++ $
• Narrative data = free-form written text– info about symptoms, medical history,
medications, exam, impression/plan
• Codified data = structured format– age, demographics, and billing codes
Content of EMRs
EMRs are increasingly utilized!
Gabriel (1994) Arthritis and Rheumatism
This is not a new idea…
Sens: 89%PPV: 57%Sens: 89%PPV: 57%
Gabriel (1994) Arthritis and Rheumatism
Conclusion: The sole reliance on such databases for the diagnosis of RA can result in substantial misdiagnosis.
…but EMR data are “dirty”
Partners HealthCare: 4 million patients
Partners HealthCare: linked by EMR
Partners HealthCare: organized by i2b2
4 million patients
31,171 patients
ICD9 RA and/or CCP checked(goal = high sensitivity)
3,585 RA patients
Classification algorithm(goal = high PPV)
Clinical subsetsClinical subsets
Discarded blood for DNA
• Natural language processing (NLP)– disease terms (e.g., RA, lupus)– medications (e.g., methotrexate)– autoantibodies (e.g., CCP, RF)– radiographic erosions
• Codified data– ICD9 disease codes– prescription medications– laboratory autoantibodies
Our library of RA phenotypes
Qing Zeng
Concept/term Accuracy of concept presence of erosion 88% seropositive 96% CCP positive 98.7% RF positive 99.3% etanercept 100% methotrexate 100%
• Natural language processing (NLP)– disease terms (e.g., RA, lupus)– medications (e.g., methotrexate)– autoantibodies (e.g., CCP, RF)– radiographic erosions
• Codified data– ICD9 disease codes– prescription medications– laboratory autoantibodies
Our library of RA phenotypes
Shawn Murphy
‘Optimal’ algorithm to classify RA:
NLP + codified data
Regression model with a penalty parameter (to avoid over-fitting)
Codified data NLP data
Tianxi Cai, Kat Liao
High PPV with adequate sensitivity
✪392 out of 400 (98%) had definite or possible RA!
This means more patients!
~25% more subjects with the complete algorithm:
3,585 subjects (3,334 with true RA)3,046 subjects (2,680 with true RA)
4 million patients
31,171 patients
ICD9 RA and/or CCP checked(goal = high sensitivity)
3,585 RA patients
Classification algorithm(goal = high PPV)
Discarded blood for DNA
Linking the Datamart-Crimson
NLP
data
Cod
ified
data
• Over 3,000 samples collected to date– cost = $10 per sample
• DNA extracted on >2,400 Buffy coats– cost = $20 per sample– >90% had ≥1 ug of DNA– >99% had ≥5 ug of DNA after WGA
Status of i2b2 Crimson collection
genotyping of 384 SNPs (RA risk alleles, AIMs, other) is ongoing at
Broad Institute
• Measured autoantibodies from plasma– 5 autoantibodies in ~380 RA patients– ~85% are CCP+, ~35% ANA+, ~15%
TPO+
• Question: are non-RA autoantibodies present at increased frequency in RA patients vs matched controls?
stay tuned…more data soon!
Status of i2b2 Crimson collection
Key questions
How can I implement your approach, and how much
better is it?
Key questions
• What are the regulatory obstacles impacting your work?
• What are the resource needs required to replicate your work at other institutions?
• What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?
Key questions
• What are the regulatory obstacles impacting your work?
• What are the resource needs required to replicate your work at other institutions?
• What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?
Regulatory obstacles
• IRB approval
• De-identified vs truly anonymous
• Open question: sharing of genetic data
Key questions
• What are the regulatory obstacles impacting your work?
• What are the resource needs required to replicate your work at other institutions?
• What are the priority short term "translational" questions in your fields that would represent the most rapid payoff on investment?
Resources required• Building a research DataMart
– clinical EMR ≠ research EMR– multiple FTE’s to build/maintain
• NLP expertise– open-source software available– iterative process for fine-tuning
• Clinical expertise– understand nature of clinical data
Resources required (cont.)
• Statistical expertise– simple algorithm is not sufficient– prepare for the unexpected!– true for narrative and codified
• Biospecimen collection, DNA extraction– varies by institution– Crimson – Broad Institute
Key questions
• What are the regulatory obstacles impacting your work?
• What are the resource needs required to replicate your work at other institutions?
• What are the priority short term "translational" questions in your field that would represent the most rapid payoff on investment?
4 million patients
31,171 patients
ICD9 RA and/or CCP checked(goal = high sensitivity)
3,585 RA patients
Classification algorithm(goal = high PPV)
Clinical subsetsClinical subsets
Discarded blood for DNA
Characteristics
i2b2 RA CORRONA
total number 3,585 7,971
Mean age (SD) 57.5 (17.5) 58.9 (13.4)
Female (%) 79.9 74.5
Anti-CCP(%) 63 N/A
RF (%) 74.4 72.1
Erosions (%) 59.2 59.7
MTX (%) 59.5 52.8
Anti-TNF (%) 32.6 22.6
Clinical features of patients
CCP has an OR = 1.5 for predicting erosions
Subset patients in clinically meaningful ways: causes of
mortality
NLP+codified data, together with statistical modeling, to define
cardiovascular disease
Non-responder to anti-TNF therapy
NLP+codified data, together with statistical modeling, to define treatment
response
Responder to anti-TNF therapy
NLP+codified data, together with statistical modeling, to define treatment
response
Post-marketing surveillance of adverse events
NLP+codified data, together with statistical modeling, to define treatment
response
pharmacovigilance
Conclusions
Options for clinical + DNA
design Clinical
data
DNA Sample size
cost
clinical trial
+++ +++ + $$$
registry ++ +++ ++ $$
claims data
+ n/a +++ $
EMR ++ +++ +++ $
Conclusion: NLP + codified data, together with appropriate statistical modeling, can yield accurate clinical data.
Options for clinical + DNA
design Clinical
data
DNA Sample size
cost
clinical trial
+++ +++ + $$$
registry ++ +++ ++ $$
claims data
+ n/a +++ $
EMR ++ +++ +++ $
Conclusion: We can collect DNA and plasma in a high-throughput manner.
Options for clinical + DNA
design Clinical
data
DNA Sample size
cost
clinical trial
+++ +++ + $$$
registry ++ +++ ++ $$
claims data
+ n/a +++ $
EMR ++ +++ +++ $
Conclusion: The cost is reasonable...even for >20,000 RA patients!
genotype
phenotype
clinical care
AcknowledgmentsZak KohaneSusanne ChurchillVivian GainerKat LiaoTianxi CaiShawn MurphyQing ZingSoumya RaychaudhuriBeth KarlsonPete SzolovitsLee-Jen WeiLynn Bry (Crimson)Sergey GoryachevBarbara Mawn & many others !
Namaste!
Narrative data (NLP text extractions)
Codified data (ICD9 codes, etc)
Run specific queries
Visualize results in a timeline
Identifying RA patients in our i2b2 RA DataMart
1993 2008
Signs and symptomsDiseases that mimick RA
Medications specific to RANotes (including whether seen by a rheumatologist)
diagnostic codes for RA
Shawn Murphy, Vivian Gainer, others
signs and symptoms c/w RA
RA without other diseases
Specific RA meds, including MTX
Seen by rheumatology
Many diagnostic codes for RA
1993 2008
Identifying RA patients in our i2b2 RA DataMart
Probability of RA: all 31K subjects
Probability of RA
Freq
uen
cy
not RA RA (n=3,585)
ROC curves for algorithms
sensi
tivit
y
1 - specificity
97% specificity
codified + NLP
NLP only
codified only
Other algorithms to classify RA
NLP OnlyCodified only
Portability!
Classification of RA cases (and not RA)
1.00
0.80
0.60
0.40
0.20
0.00
Pro
bab
ility
R
A
Not RA possible Yes RA
threshold
0.29
???
Diagnosis = Ankylosing Spondylitis
(but many RA codes)
A few signs and symptoms c/w RA
NLP with few mentions of RA Specific meds
Visits to BWH/MGH
diagnostic codes for RA
Probability RA = 0.78
Diagnosis = JRA (but many RA codes)
signs and symptoms c/w RA
NLP with “RA” and “JRA”
Specific meds
Visits to the RA Center at BWH
Many diagnostic codes for RA
Probability RA = 0.33
Diagnosis not clear initially…
signs and symptoms c/w RA
NLP without much “RA”, few specific meds (MTX x 1)
…and few diagnostic codes for RA, despite multiple LMR notes, including visits to the BWH Arthritis Center
Now the false negatives…
Diagnosed in 1992, little follow-up
For some reason few RA diagnostic codes
Probability RA = 0.11
Enbrel (etanercept)codified: 1,628NLP: 3,796
overlap: 1,612 (99%)
Note: review of 50 NLPoccurrences shows that 38 out of 50 actively on Enbrel
Medications: codified data vs. NLP
top related