whitney st.charles research alliance in math and science 2007 mentors: yu (cathy) jiao, ph.d. robert...

27
Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering Division Using TF-IDF Anomalies to Cluster Documents on Subject Matter Natural Language Processin And Computational Linguisti An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies

Upload: jacob-baker

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

Whitney St.CharlesResearch Alliance in Math and Science 2007

Mentors:Yu (Cathy) Jiao, Ph.D.Robert Patton, Ph.D.

Computational Sciences and Engineering Division

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

Natural Language Processing And Computational Linguistics

An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies

Page 2: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Purposes of document clustering

Data overabundance YouTube generates 200 terabytes of data per day

How do we sift through those kinds of quantities? Searching

Reduces the set tremendously Document Clustering

Is a knowledge discovery technique Categorizes results into meaningful groups Allows the user to browse quickly to the target

2

Page 3: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Document clustering users

Financial analysts Identify certain trends to develop forecasts about a

particular company

Business Intelligence Identify products that are associated with or dependent

upon one another

Military Identify terrorist cells from blog activity and movement of

materials

You! Narrow down hundreds of thousands of internet search

results to find the kinds of sites you want

3

Page 4: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

• A word-by-word comparison of each document is made to determine similarity

• Unfortunately, this method…• Does not handle context very well

• Compares several hundred/ several thousand words for each document• Is very computationally expensive• Requires expensive SIMD machines

Current document clustering technique

4

Page 5: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Contributions to the field

• Identify only those words which are more indicative of the subject matter– If airline occurs 20% more than is “normal,” it has

something to do with the subject

• Examine both simple and complex noun phrases to address the context of the document

• Generate much smaller vectors, containing an average of 82% fewer terms!

• Cluster more accurately because only “important” words are chosen

5

Page 6: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Our method

6

Page 7: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Establishing the baseline

• Train the program to recognize what is “normal” for a given term– Need an entire English language corpus

• Corpus: a large, structured set of texts compiled to be representative of a language

• uses hundreds of thousands of words in every allowable way

• Using a corpus, the program can• Establish usage statistics• Learn linguistic rules

Example: The Brown Corpus http://www.edict.com.hk/concordance/WWWConcappE.htm

7

Page 8: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Extracting words and phrases

8

Page 9: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Part-of-speech tagging

Tags every word in the sentence with the correct part-of-speech

Achieves an accuracy of 97.24% Is necessary because token extraction methods are each

dependent upon correct tagging

Passes the tagged sentence to the token extractor

9

Page 10: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Token extractor

Extracts Words Simple noun phrases Complex noun phrases

10

Page 11: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Word extraction

Uses POS tagged data to identify only adjectives, verbs, and nouns

Uses the Porter stemmer to identify unique words cut common suffixes such as –ing, -tion, -e, -es, -s

Example: “recreation” and “recreational” are both identified as “recreat”

11

Page 12: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Why nouns?

Are named entities

Answer the question “What”

Are less ambiguous than verbs Example: “cook up a good meal” or “cook up a new

solution”

12

Page 13: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Simple noun phrase extraction

Accepts only consecutive nouns Example: summer intern, union representative

Provides a set of short, highly descriptive phrases

13

Page 14: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Complex noun phrase extraction techniques Static Rule-based/ Finite State Automata

Rely on the aptitude of linguist formulating rule set

Machine Learning Rely on the “completeness” of the training set

14

Page 15: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Static rule-based extraction

Establishes a list of linguistic rules A determiner preceding a noun marks the beginning of a

noun phrase A determiner may not precede a noun phrase

15

determiner/adjective

noun/ pronoun

adjectiveRelative clause/Prepositional phrase/noun

noun/ pronoun/ determiner

Page 16: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Static extraction shortcomings

Unanticipated rules The subjective nature of language

Difficulty finding non-recursive, base NP’s [The man [whose red hat [I borrowed

yesterday]RC ]RC [in the street]PP [that is next to my house]RC ]NP lives [next door]NP.

[The man]NP whose [red hat]NP I borrowed [yesterday]NP in [the street]NP that is next to [my house]NP lives [next door]NP.

Structural ambiguity

16

Page 17: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Structural ambiguity example““I saw the man with the telescope.”I saw the man with the telescope.”

17

Page 18: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Machine learning extraction

18

Is all about Uses a corpus

Is based on statistics The more it sees a particular occurrence, the more likely

it is to prefer it Makes better educated guesses about structural ambiguity Discovers thousands of unanticipated rules

Page 19: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Transformation-based complex noun phrase extraction

An ‘error-driven’ approach for learning an ordered set of rules

1. Generate all rules that correct at least one error.2. For each rule:

(a) Apply to a copy of the most recent state of the training set.

(b) Score result3. Select rule with best score.4. Update training set by applying selected rule.5. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1.

19

Page 20: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Determining anomaly sets

TF-IDF: Term Frequency – Inverse Document Frequency Number of local occurrences of term multiplied by

uniqueness measure of term in document set

TF-ICF: Term Frequency – Inverse Corpus Frequency Average number of corpus occurrences of term multiplied

by uniqueness measure of term in the corpus

20

Page 21: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Each document has its own anomaly vector

21

Page 22: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Clustering the data Unweighted Pair Group Method with Average meansUnweighted Pair Group Method with Average means

22

Page 23: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

23

Performance Metrics Used

Precision = number of correct responsesnumber of responses

Recall = number of correct responsesnumber correct in key

F-measure = 2RPR + P

Page 24: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

24

Cluster Results using Vector Space Model

Cluster Results using modified Vector Space Model with anomaly sets

Page 25: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Future Work

Determine clustering results for both simple and complex noun phrases

Could be applied to other clustering techniques, such as swarming

25

Page 26: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Acknowledgements

The Research Alliance in Math and Science program

Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy.

Dr. Cathy Jiao

Dr. Robert Patton

Dr. Thomas Potok

26

Page 27: Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

27