ii-sdv 2012 making knowledge discoverable: the role of agile text mining

36
Making Knowledge Discoverable: The Role of Agile Text Mining David Milward Linguamatics II-SDV, Nice, April 2012

Upload: dr-haxel-congress-and-event-management-gmbh

Post on 28-Nov-2014

205 views

Category:

Internet


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Making Knowledge Discoverable:

The Role of Agile Text Mining

David Milward

Linguamatics

II-SDV, Nice, April 2012

Page 2: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Search vs. Text Mining

• Agile Text Mining

– Linguamatics I2E

• Relationship Extraction: Text Mining + Search

• Finding the Most Relevant Documents: Search + Text Mining

• Accelerating a Search Strategy

• Example Results from

– multiple documents

– within complex documents

• Reproducible Workflows

2

Overview

Page 3: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleSearch vs. Text Mining

Text MiningSearch Engine

Filter to find

most relevant

documents,

then read

News Feeds

Manipulate

the text to

discover

what is there

company activity company

Sanofi bid Aventis

Roche partner Antisoma

Scientific Literature Patents Internal Reports Clinical Trials

3

Natural Language

Processing (NLP) to

understand meaning

Statistics to provide trends

Page 4: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Text mining provides ability to discover

– but typically queries have to be programmed in, and processing is slow

• Search provides ability to filter quickly to relevant documents

– but poor at answering open questions e.g. “what are biomarkers for breast cancer”

• Combine text mining with search to discover within specific contexts e.g.

What is a risk factor for diabetes

4

Agile Text Mining

Filter to the context of interestDiscover what is available

Page 5: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Linguamatics I2E first adopted in pharma/biotech, including 9 of top 10

• Used to answer a wide range of questions e.g.

– Dose durations of follow-on clinical trials

– Therapeutic usages of recombinant proteins

– Cofactors for Nuclear Receptors

• Wide range of application areas across the drug pipeline e.g.

– Target Prioritization, Safety/Toxicity, Clinical Trial Design

– Competitive Intelligence, Marketing

• Are the Life Sciences special?

– Particularly knowledge intensive, so high demand

– A lot of complex, ambiguous terminology, balanced by good resources

• I2E is a generic platform

– Now being used in chemicals, consumer products, health …

5

Technology Adoption

Page 6: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

Scientific LiteratureAbstracts e.g. MEDLINE

Full text journal articles

e.g. via Quosa

Patents

News

Conference Abstracts

Internal documents

Social MediaTwitter

6

Example Data Sources

Competitive Intelligence,

R&D, Marketing

Clinical Trial Design, Safety,

Relative Efficacy of Drugs

Clinical Trials

FDA Drug Label Inserts

Electronic Health Records

MEDLINE (20 million abstracts)

Patent Full Text

USPTO, WIPO, EPO …

Clinical Trials …

Cloud Based Service

Page 7: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

Actionable

InformationAd hoc queries Smart queries Multi queries Batch queries

NLP Class, conceptRegular

Expression

Chemical

Structure

Agile NLP Querying

Linguamatics I2E: An Agile Text Mining Platform

Internal/External

Sources

Decision Support

Documents

Domain knowledge – ontologies, thesauri, dictionariesOntologies

Flexible, Highly Scalable Rich IndexesIndexing

Structured Results

(Actionable

Information)

7

Page 8: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Using I2E to Extract Relationships

8

Page 9: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleImprove Recall Using Terminologies

Which genes are known to affect breast cancer?

(ESR1 OR ERBB2 OR CHEK2 OR BRCA1...) Affect Breast Cancer

(ESR1 OR ERBB2 OR CHEK2 OR BRCA1...) Affect

(“Breast Cancer” OR “Breast Carcinoma” OR “Cancer of Breast” OR ...)

Could be 10,000s of terms

Could be 100s of terms

9

Any Gene/Protein Relationship Breast Cancer

Page 10: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleExtracting Relationships using NLP: Biomarkers

Gene(from Entrez)

Complex linguistic relationship

Disease(from MedDRA)

Relevant sentence extracted with terms highlighted

Link to source document

Page 11: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleExtracting Numerical Data in Context: Safety

CompoundPotential safety issues

In this organ At this dosage

11

Page 12: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleStandardizing/Clustering to Save Review

• Go directly to answers, e.g. find all the genes associated with a specific disease

• Highlighted evidence and link to the document

• Save time in review:

– Rather than reading all 470 documents for ERBB2, just read enough to check relationship exists

– Can then concentrate on the longer tail of less well-known genes/proteins

• Customer reports of order of magnitude speed-up

12

Page 13: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Finding the Most Relevant Documents

13

Page 14: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleFinding Relevant Documents

• More comprehensive results

– Terminologies

– Regular Expressions

• Precise expressions e.g. for miRNA (simplified)

» let-?\d+.*

» mirn?a?-?\d+.*

– High throughput searches

• lists of 500 genes, chemicals etc.

– Chemical substructure and similarity searching

• Reduced noise in results

– Use of linguistics rather than distance e.g. n words

– Regions

• abstract, methods, claims, claim, table

– Local negation e.g. “dead” for death, but not “dead time”

14

Page 15: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleMore Comprehensive Results: Terminologies

• Re-use terminologies with 10s of thousands of concepts, and 100s of thousands of synonyms

• If we are interested in genes associated with cancer, we don’t just want synonyms of “cancer” e.g. “malignant neoplasm”, but also any specific cancer

Malignant neoplasm

Malignant tumor

Malignancy

Carcinoma

….

Cancer Leukaemia

Hamartoma

Paraneoplastic Syndromes

Adamintonoma

Peutz-Jeghers Syndromes

Hydatiform Mole

Plus 1000s more…

Ways of Expressing concept of “Cancer”

Different kinds of Cancer

15

Page 16: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleChemical Text Mining

• Ability to efficiently answer precise queries e.g.

– What chemicals with this substructure act as inhibitors

16

Page 17: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Accelerating a Search Strategy

17

Page 18: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

Looking for words before or after the word of interest

18

Contextual Landscaping

Results from 100K

USPTO Patents

Linguistics to reduce noise, to find types of chocolate

Page 19: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• In search we can often restrict to a particular region of a

document

• In text mining we can first check all the regions that the term

appears in

– E.g. look for the regions that IC50 appears in

• See what you want and what you might lose

– We then have better evidence to restrict to particular regions

19

Which Regions Does this Term Appear In

Results from 100K 2011 Patents

Page 20: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Find most frequent codes in patents where a company or set

of companies (e.g. pharma) is the Applicant or Assignee

20

What IPC Codes are Assigned to a Company’s Patents

Results from recent USPTO Patents

Page 21: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Summarizing Results from Multiple

Documents

for Efficient Review and Integration

Page 22: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Find and extract patterns

to find a particular

concept

• For example: how do we

distinguish people:

– willing to get a vaccine

– not willing

• If we can partition the

two populations we can

then see how they are

influenced

Semantics: 1000s of ways of saying the same thing

22

Examples from Twitter

Concept of getting a vaccine

Page 23: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Terminologies can be used to link individual synonyms to a semantic concept

• Concept is uniquely identified by the ontology and the node identifier (and, typically, by the preferred term)

• This allows better clustering of results, better statistics, and allows us to connect results from text mining with other databases, or the semantic web

Semantics: Standardized Identifiers for Concepts

23

Page 24: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Extract the same relationship, even if it is expressed very differently e.g.

– SRC phosphorylates EGFR

– phosphorylation of EGFR by SRC

– EGFR is phosphorylated by SRC

• Establish the direction of the relationship

– SRC phosphorylates EGFR, not EGFR phosphorylates SRC

Semantics: Directed Relationships

24

Page 25: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Finding Information Across Multiple

Documents

Page 26: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Categorizing Patents according to Disease Area

– Linguistics and ontologies provide better context, greater insight

26

Document Categorization

Freemind

400K Patents

Page 27: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleTell me everything about X

• Search provides most relevant documents mentioning X

• Text mining can summarize distinct properties of X by

clustering facts extracted from all documents

27

Page 28: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

Visualization via Cytoscape

Linking Knowledge: Indirect Relationships

28

Thalidomide to a Gene/Protein

A Gene/Protein to Angiogenesis

Page 29: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Within a Document

Page 30: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleLinking Within Complex Patent Documents

• Linking information in one part of a patent to another e.g.

– Finding compounds with a particular substructure where a value is

reported

30

Page 31: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleFrom One Claim to the Next

• For information in claims, often want to work back along

the chain of claims

31

Page 32: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Reproducible Workflows

32

Page 33: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Queries can be run regularly as

documents get updated

• Integrate text mining with workflow

tools for

– up-to-date dashboards

– alerts

– integration with other structured data

sources

– web portals

• Provide a range of analytics and

visualization

33

Automation

Pipeline Pilot

Page 34: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results

Intervention

Using Pipeline Pilot Visualization

34

Page 35: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results

Dates

Colours indicate Phase

Using Pipeline Pilot visualization

35

Page 36: II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

Click to edit Master title styleClick to edit Master title style

• Agile Text Mining provides query power and flexibility

– Can address the long tail of less predictable questions

– Allows “queries of arbitrary complexity” combining

• NLP, ontologies, regular expressions, regions, numerical expressions,

chemical substructure/similarity, disambiguation

– Uses the data itself to

• inform search strategies

• build terminologies

– Provides flexible, structured output to

• allow integration with existing structured data

• fit into existing workflows

• This enables very wide application of text mining: wherever

there is a need to search, extract, or categorize information

Conclusions

36