extracting medical attributes of published clinical trials
TRANSCRIPT
Extracting Medical Attributes of Published Clinical Trials Knowledge Representation of Textual Data
Sanghamitra Deb, [email protected],@sangha_deb
Accenture Technology Laboratory
Motivation
Currently used text Extraction Techniques
Deepdive Results: Working with Deepdive
Acknowledgements : I would like thank Professor Chris Re, Jaeho Shin, Jason Fries and Theodoros Rekatsinas for personally assisting me to use Deepdive for knowledgee extraction. This work is funded and supported by Accenture Technology Laboratory. I would also like to thank Joshua Neyland and Hyon S Chu for supporting the project.
Conclusions
➢ Extracting Data from documents manually is a common practice in many domains such as Pharma, Legal Investigations, financial transactions etc. We will concentrate on Clinical Trials for this study.
➢ Understanding relationships between attributes is an important part of Clinical trials. This process involves several tens of hours of a skilled researcher. The goal of this research is automate manual knowledge extraction for
➢ Examples of meta data extraction: “disease studied”, “drugs X cures disease Y”, “drug X contains compound Y”, “age group of participants”, “dosage”, “gender”, …
❑ Deepdive creates structure (SQL Databases) from unstructured data(text)
❑ It is a data management system that tackle’s extraction, integration, and prediction problems in a single system.
❑ DeepDive asks the developer to think about features—not algorithms. In DeepDive's joint inference based approach, the user only specifies the necessary signals or features.
❑ Users can write simple rules based on domain knowledge that inform the inference (learning) process and provide feedback to improve predictions.
❑ Tedious training of each prediction not required, distant learning with a small training set works well.
ModelProcess Use Case
Rule Based + Manual Curating.
Parse and Token
Composite Rules
Filtering Rules
Manual inspection
Semi Structured reports/articles (financial reports, medication labels, etc). Small number of Documents (<100). Medium level precision is enough.
Supervised ML: Logistic Regression/SVM/Naive Bayes/RF/Classification…
Transform business problem into prediction variable. Generate Features: (bag of words, ngrams, vectorization,wordnet,…) Get Training Data. Feed into ML pipeline.
Converting unstructured text to structured features and prediction variables is simple. Training Data is available. Example: sentiments associated with products from reviews with training data from ratings.
Un Supervised ML: Clustering, Topic Modeling, Word2vec
The parsed data is used as input to un-supervised techniques. Most unsupervisedd techniques are used to extract hidden facts in data.
Training Data Not Available. Exact precision measurements are not important. Results are coherent themes or synonymous ideas in the corpus
Factor Graphs
Related Attributes
Methylphenidate, sold under the trade names Ritalin …
Candi-date Tagging
SupervisionLearning and Inference
Drug Treats Disease
Ritalin ADHD
Tylenol fever
Aspirin migrane
FeedBackHuman Feedback
Knowledgebase: Drugs - Disease
Ibuprofen, from isobutylphenylpropanoic acid, is a nonsteroidal anti-inflammatory drug (NSAID) used for treating pain, fever, and inflammation
Unstructured Information
Candidates (Rule Based: High Recall)
drug disease
ibuprofen pain
ibuprofen fever
ibuprofen inflammation
ibuprofen renal failure
ritalin cancer
ritalin ADHD
drug disease
ibuprofen pain
ibuprofen fever
Ritalin ADHD
Training Data from FDA and Manual Curation
Input Data User Created Data Bases
sentences POS,NER, etc
Ibuprofen, from isobutylphenylpropanoic acid, is a nonsteroidal anti-inflammatory drug (NSAID) used for treating pain.
Methylphenidate, sold under the trade names Ritalin …
NLP Parsed Sentences
Distant Supervison
References
➢ We have successfully extracted attributes and relation between attributes.
➢ Deepdive is well suited for this purpose since it learns by collecting evidence from the entire corpus and is able to infer complex relationships in data.
➢ The final result is a structured data base that has been created from hunderds of gigabytes of texts from journal articles.
Mindbender: Browsing Input Data loaded to deepdive
Mindtagger: assists data labeling tasks to quickly assess the precision and/or recall of the extraction
Extract data from sentences and create user defined functions (rules, heuristic schemes) to extract mentions of drug, compounds or diseases. Extract features from data set based on domain knowledge and deepdive guided generic feature set. Write inference rules, weights, calibration and holdout parameters Provide feedback from calibration plots and Mindtagger outputs and repeat the steps above
❑ DeepDive is project led by Christopher Ré at Stanford University. Current group members include: Michael Cafarella, Xiao Cheng, Raphael Hoffman, Dan Iter, Thomas Palomares, Alex Ratner, Theodoros Rekatsinas, Zifei Shan, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. All materials are found at http://deepdive.stanford.edu/
❑ The Artificial Intelligence group at Accenture Technology Laboratory is collaborating with Professor Chris Re’s group to incorporate intelligent language understanding to facilitate client delivery. https://www.accenture.com/us-en/accenture-technology-labs-index
❑ DeepDive: A Data Management System for Automatic Knowledge Base Construction. Ce Zhang.Ph.D. Dissertation, University of Wisconsin-Madison, 2015.
❑ Incremental Knowledge Base Construction Using DeepDive Sen Wu, Ce Zhang, Christopher De Sa, Jaeho Shin, Feiran Wang, and C. Ré. VLDB. 2015.
Learning & Inference
drug disease relatedibuprofen pain Tibuprofen fever Tibuprofen inflammation Tibuprofen renal failure Fritalin cancer F
Deepdive creates factor graphs and inferencing is done using Gibbs Sampling. Joint inferencing process ensures that priors are not accepted as ground truth. Uncertainty of one event influences other events.
probability of 0.99 implies there is a 99% chance that the drugs and compounds are related
Methylphenidate, sold under the trade names Ritalin among others, is a central nervous system (CNS) stimulant of the phenethylamine[3] and piperidine classes that is used in the treatment of attention deficit hyperactivity disorder (ADHD) and narcolepsy. Methylphenidate has been studied and researched for over 50 years and has a very good efficacy and safety record for the treatment of ADHD.[4]
Brand name
Compound
Disease