text mining tools: instruments for scientific discovery
DESCRIPTION
Text Mining Tools: Instruments for Scientific Discovery. Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000. Outline. What knowledge can we discover from text? How is knowledge discovered from other kinds of data? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/1.jpg)
Text Mining Tools:Instruments for Scientific Discovery
Marti Hearst UC Berkeley SIMS
IMA Text Mining WorkshopApril 17, 2000
![Page 2: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/2.jpg)
Marti Hearst -- IMA TM Workshop
Outline
What knowledge can we discover from text?
How is knowledge discovered from other kinds of data?
A proposal: let’s make a new kind of scientific instrument/tool.
Note: this talk contains some common materials and themes from another one of my talks entitled “Untangling Text Data Mining”
![Page 3: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/3.jpg)
Marti Hearst -- IMA TM Workshop
What is Knowledge Discovery from Text?
![Page 4: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/4.jpg)
Marti Hearst -- IMA TM Workshop
What is Knowledge Discovery from Text?
Finding a document? Finding a person’s
name in a document?
This informationis already knownto the author atleast.
Needles in Haystacks
Needlestacks
![Page 5: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/5.jpg)
Marti Hearst -- IMA TM Workshop
What to Discover from Text?
What news events happened last year?
Which researchers most influenced a field?
Which inventions led to other inventions?
Historical,Retrospective
![Page 6: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/6.jpg)
Marti Hearst -- IMA TM Workshop
What to Discover from Text?
What are the most common topics discussed in this set of documents?
How connected is the Web?
What words best characterize this set of documents’ topics?
Which words are good triggers for a topic classifier/filter?
Summariesof the data itself
Features used in algorithms
![Page 7: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/7.jpg)
Marti Hearst -- IMA TM Workshop
Classifying Application Types
Patterns Non- Novel Nuggets
Novel Nuggets
Non- textual data Standard data
mining Database queries
AI Discovery
Systems
Textual data Computational
linguistics I nf ormation
retrieval Real text
data mining
![Page 8: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/8.jpg)
Marti Hearst -- IMA TM Workshop
The Quandary
How do we use text to both– Find new information not known to
the author of the text– Find information that is not about the
text itself?
![Page 9: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/9.jpg)
Marti Hearst -- IMA TM Workshop
Idea: Exploratory Data Analysis
Use large text collections to gather evidence to support (or refute) hypotheses– Not known to author: Make links across many texts– Not self-referential: Work within the text domain
![Page 10: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/10.jpg)
Marti Hearst -- IMA TM Workshop
The Process of Scientific Discovery
Four main steps (Langley et al. 87):– Gathering data– Finding good descriptions of data– Formulating explanatory hypotheses– Testing the hypotheses
My Claim: We can do this with text as the data!
![Page 11: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/11.jpg)
Marti Hearst -- IMA TM Workshop
Scientific Breakthroughs
New scientific instruments lead to revolutions in discovery– CAT scans, fMRI– Scanning tunneling electron microscope– Hubble telescope
Idea:Make A New Scientific Instrument!
![Page 12: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/12.jpg)
Marti Hearst -- IMA TM Workshop
How Has Knowledge been Discovered in Non-Textual Data?
Discovery from databases involves finding patterns across the data in the records– Classification
»Fraud vs. non-fraud
– Conditional dependencies»People who buy X are likely to also buy Y
with probability P
![Page 13: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/13.jpg)
Marti Hearst -- IMA TM Workshop
How Has Knowledge been Discovered in Non-Textual Data? Old AI work (early 80’s):
– AM/Eurisko (Lenat)– BACON, STAHL, etc. (Langley et al.)– Expert Systems
A Commonality: – Start with propositions– Try to make inferences from these
Problem: – Where do the propositions come from?
![Page 14: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/14.jpg)
Marti Hearst -- IMA TM Workshop
Intensional vs. Extensional
Database structure:– Intensional: The schema– Extensional: The records that instantiate the
schema Current data mining efforts make
inferences from the records Old AI work made inferences from what
would have been the schemata– employees have salaries and addresses– products have prices and part numbers
![Page 15: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/15.jpg)
Marti Hearst -- IMA TM Workshop
Goal: Extract Propositions from Text
and Make Inferences
![Page 16: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/16.jpg)
Marti Hearst -- IMA TM Workshop
Why Extract Propositions from Text?
Text is how knowledge at the propositional level is communicated
Text is continually being created and updated by the outside world– So knowledge base won’t get stale
![Page 17: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/17.jpg)
Marti Hearst -- IMA TM Workshop
Example: Etiology
Given – medical titles and abstracts– a problem (incurable rare disease)– some medical expertise
find causal links among titles– symptoms– drugs– results
![Page 18: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/18.jpg)
Marti Hearst -- IMA TM Workshop
Swanson Example (1991) Problem: Migraine headaches (M)
– stress associated with M– stress leads to loss of magnesium– calcium channel blockers prevent some M– magnesium is a natural calcium channel blocker– spreading cortical depression (SCD) implicated in M– high levels of magnesium inhibit SCD– M patients have high platelet aggregability– magnesium can suppress platelet aggregability
All extracted from medical journal titles
![Page 19: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/19.jpg)
Marti Hearst -- IMA TM Workshop
Gathering Evidence
stress
migraine
CCB
magnesium
PA
magnesium
SCD
magnesiummagnesium
![Page 20: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/20.jpg)
Marti Hearst -- IMA TM Workshop
Gathering Evidence
migraine magnesium
stress
CCB
PA
SCD
![Page 21: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/21.jpg)
Marti Hearst -- IMA TM Workshop
Swanson’s TDM
Two of his hypotheses have received some experimental verification.
His technique– Only partially automated– Required medical expertise
Few people are working on this.
![Page 22: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/22.jpg)
Marti Hearst -- IMA TM Workshop
One Approach: The LINDI ProjectLinking Information for New Discoveries
Three main components:– Search UI for building and reusing hypothesis
seeking strategies.
– Statistical language analysis techniques for extracting propositions from text.
– Probabilistic ontological representation and reasoning techniques
![Page 23: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/23.jpg)
Marti Hearst -- IMA TM Workshop
LINDI
First use category labels to retrieve candidate documents,
Then use language analysis to detect causal relationships between concepts,
Represent relationships probabilistically, within a known ontology,
The (expert) user – Builds up representations– Formulates hypotheses– Tests hypotheses outside of the text system.
![Page 24: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/24.jpg)
Marti Hearst -- IMA TM Workshop
Objections
Objection:– This is GOF NLP, which doesn’t work
Response:– GOF NLP required hand-entering of
knowledge– Now we have statistical techniques
and very large corpora
![Page 25: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/25.jpg)
Marti Hearst -- IMA TM Workshop
Objections
Objection:– Reasoning with propositions is brittle
Response:– Yes, but now we have mature
probabilistic reasoning tools, which support »Representation of uncertainty and degrees of
belief»Simultaneously conflicting information»Different levels of granularity of information
![Page 26: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/26.jpg)
Marti Hearst -- IMA TM Workshop
Objections
Objection:– Automated reasoning doesn’t work
Response– We are not trying to automate all
reasoning, rather we are building new powerful tools for »Gathering data»Formulating hypotheses
![Page 27: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/27.jpg)
Marti Hearst -- IMA TM Workshop
Objections
Objection:– Isn’t this just information extraction?
Response:– IE is a useful tool that can be used in
this endeavor, however» It is currently used to instantiate pre-
specified templates » I am advocating coming up with entirely
new, unforeseen “templates”
![Page 28: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/28.jpg)
Marti Hearst -- IMA TM Workshop
Traditional Semantic Grammars
Reshape syntactic grammars to serve the needs of semantic processing.
Example (Burton & Brown 79)
– Interpreting “What is the current thru the CC when the VC is 1.0?”
<request> := <simple/request> when <setting/change><simple/request> := what is <measurement><measurement> := <meas/quant> <prep> <part><setting/change> := <control> is <control/value><control> := VC
– Resulting semantic form is:(RESETCONTROL (STQ VC 1.0) (MEASURE CURRENT CC))
![Page 29: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/29.jpg)
Marti Hearst -- IMA TM Workshop
Statistical Semantic Grammars
Empirical NLP has made great strides– But mainly applied to syntactic structure
Semantic grammars are powerful, but– Brittle – Time-consuming to construct
Idea:– Use what we now know about statistical NLP
to build up a probabilistic grammar
![Page 30: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/30.jpg)
Marti Hearst -- IMA TM Workshop
Example:Statistical Semantic
Grammar To detect causal relationships
between medical concepts– Title:
Magnesium deficiency implicated in increased stress levels.
– Interpretation: <nutrient><reduction> related-to
<increase><symptom>
– Inference:» Increase(stress, decrease(mg))
![Page 31: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/31.jpg)
Marti Hearst -- IMA TM Workshop
Example:Using Semantics +
Ontologies acute migraine treatment intra-nasal migraine treatment
![Page 32: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/32.jpg)
Marti Hearst -- IMA TM Workshop
Example:Using Semantics +
Ontologies [acute migraine] treatment intra-nasal [migraine treatment]
We also want to know the meaning of the attachments,not just which way the attachments go.
![Page 33: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/33.jpg)
Marti Hearst -- IMA TM Workshop
Example:Using Semantics +
Ontologies acute migraine treatment <severity> <disease> <treatment> intra-nasal migraine treatment <Drug Admin Routes> <disease>
<treatment>
![Page 34: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/34.jpg)
Marti Hearst -- IMA TM Workshop
Example:Using Semantics +
Ontologies acute migraine treatment <severity> <disease> <treatment> <severity> <Cerebrovascular Disorders>
<treatment> intra-nasal migraine treatment <Drug Admin Routes> <disease> <treatment> <Administration, Intranasal> <disease> <treatment>
Problem: which level(s) of the ontology should be used?We are taking an information-theoretic approach.
![Page 35: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/35.jpg)
Marti Hearst -- IMA TM Workshop
The User Interface
A general search interface should support– History– Context– Comparison– Operator Reuse– Intersection, Union, Slicing– Visualization (where appropriate)
We are developing such an interface as part of a general search UI project.
![Page 36: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/36.jpg)
Marti Hearst -- IMA TM Workshop
Summary
Let’s get serious about discovering new knowledge from text
This will build on existing technologies
This also requires new technologies
![Page 37: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/37.jpg)
Marti Hearst -- IMA TM Workshop
Summary Let’s get serious about discovering
new knowledge from text– We can build a new kind of scientific
instrument to facilitate a whole new set of scientific discoveries
– Technique: linking propositions across texts (Jensen, Harabagiu)
![Page 38: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/38.jpg)
Marti Hearst -- IMA TM Workshop
Summary
This will build on existing technologies– Information extraction (Riloff et al., Hobbs et al.)
– Bootstrapping training examples (Riloff et al.)
– Probabilistic reasoning
![Page 39: Text Mining Tools: Instruments for Scientific Discovery](https://reader035.vdocuments.mx/reader035/viewer/2022081512/56815884550346895dc5e668/html5/thumbnails/39.jpg)
Marti Hearst -- IMA TM Workshop
Summary
This also requires new technologies– Statistical semantic grammars– Dynamic ontology adjustment– Flexible search UIs