csc 8520 spring 2010. paula matuszek cs 8520: artificial intelligence natural language processing...

71
CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

Upload: melissa-harrison

Post on 28-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

CSC 8520 Spring 2010. Paula Matuszek

CS 8520: Artificial Intelligence

Natural Language Processing Introduction

Paula Matuszek

Spring, 2010

Page 2: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

2CSC 8520 Spring 2010. Paula Matuszek

Natural Language Processing• speech recognition• natural language understanding• computational linguistics• psycholinguistics• information extraction• information retrieval• inference• natural language generation• speech synthesis• language evolution

Page 3: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

3CSC 8520 Spring 2010. Paula Matuszek

Applied NLP• Machine translation

• spelling/grammar correction

• Information Retrieval

• Data mining

• Document classification

• Question answering, conversational agents

Page 4: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

4CSC 8520 Spring 2010. Paula Matuszek

Natural Language Understanding

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

sound sound waveswaves

internal internal representationrepresentation

Page 5: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

5CSC 8520 Spring 2010. Paula Matuszek

Sounds Symbols Sense

Natural Language Understanding

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

sound sound waveswaves

internal internal representationrepresentation

Page 6: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

6CSC 8520 Spring 2010. Paula Matuszek

•“How to recognize speech, not to wreck a nice beach”

•“The cat scares all the birds away”

•“The cat’s cares are few”

Where are the words? sound sound

waveswaves

internal internal representationrepresentation

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

- pauses in speech bear little relation to word breakspauses in speech bear little relation to word breaks+ intonation offers additional clues to meaning+ intonation offers additional clues to meaning

Page 7: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

7CSC 8520 Spring 2010. Paula Matuszek

•“The dealer sold the merchant a dog”

• “I saw the Golden bridge flying into San Francisco”

• Word creation:establishestablishmentthe church of England as the official state church.disestablishmentantidisestablishmentantidisestablishmentarianantidisestablishmentarianismis a political philosophy that is opposed to the separation of church and state.

Dissecting words/sentences

internal internal representationrepresentation

accoustic /phonetic

morphological/syntactic semantic /

pragmatic

sound sound waveswaves

Page 8: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

8CSC 8520 Spring 2010. Paula Matuszek

• “I saw Pathfinder on Mars with a telescope”

• “Pathfinder photographed Mars”

• “The Pathfinder photograph from Ford has arrived”

• “When a Pathfinder fords a river it sometimes mars its paint job.”

What does it mean?

sound sound waveswaves

internal internal representationrepresentation

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

Page 9: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

9CSC 8520 Spring 2010. Paula Matuszek

What does it mean?

sound sound waveswaves

internal internal representationrepresentation

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

• “Jack went to the store. HeHe found the milk in aisle 3. HeHe paid for itit and left.”

• “ Q: Did you read the report?

A: I read Bob’s email.”

Page 10: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

CSC 8520 Spring 2010. Paula Matuszek 19

The steps in NLP (Cont.)

• Morphology: Concerns the way words are built up from smaller meaning bearing units.

• Syntax: concerns how words are put together to form correct sentences and what structural role each word has

• Semantics: concerns what words mean and how these meanings combine in sentences to form sentence meanings

Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

Page 11: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

CSC 8520 Spring 2010. Paula Matuszek2/20/2007 Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt 20

The steps in NLP (Cont.)• Pragmatics: concerns how sentences are

used in different situations and how use affects the interpretation of the sentence

• Discourse: concerns how the immediately preceding sentences affect the interpretation of the next sentence

Page 12: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

12

CSC 8520 Spring 2010. Paula Matuszek

Some of the Tools• Regular Expressions and Finite State

Automata

• Part of Speech taggers

• N-Grams

• Grammars

• Parsers

• Semantic Analysis

Page 13: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

CSC 8520 Spring 2010. Paula Matuszek2/20/2007 Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt 21

Parsing (Syntactic Analysis)

• Assigning a syntactic and logical form to an input sentence• uses knowledge about word and word meanings (lexicon)• uses a set of rules defining legal structures (grammar)

• Paula ate the apple.(S (NP (NAME Paula)) (VP (V ate) (NP (ART the) (N apple))))

Page 14: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

CSC 8520 Spring 2010. Paula Matuszek2/20/2007 Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt 22

Word Sense Resolution • Many words have many meanings or senses• We need to resolve which of the senses of an

ambiguous word is invoked in a particular use of the word

• I made her duck. (made her a bird for lunch or made her move her head quickly downwards?)

Page 15: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

CSC 8520 Spring 2010. Paula Matuszek2/20/2007 Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt 23

Reference Resolution• Domain Knowledge (Registration transaction)• Discourse Knowledge• World Knowledge• U: I would like to register in an CSC Course.• S: Which number?• U: Make it 8520.• S: Which section?• U: Which section is in the evening?• S: section 1.• U: Then make it that section.

Page 16: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

CSC 8520 Spring 2010. Paula Matuszek2/20/2007 Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt 28

Morphological Analysis

Syntactic Analysis

Semantic Semantic AnalysisAnalysis

Discourse Analysis

Pragmatic Analysis

Internal representation

lexicon

user

Surface form

Perform action

stems

parse tree

Resolve references

Stages of NLPStages of NLP

Page 17: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

17

CSC 8520 Spring 2010. Paula Matuszek

Human Languages• You know ~50,000 words of primary language,

each with several meanings• six year old knows ~13000 words• First 16 years we learn 1 word every 90 min of

waking time• Mental grammar generates sentences -virtually

every sentence is novel• 3 year olds already have 90% of grammar• ~6000 human languages – none of them simple!

Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London

Page 18: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

18

CSC 8520 Spring 2010. Paula Matuszek

Human Spoken language• Most complicated mechanical motion of the human

body– Movements must be accurate to within mm

– synchronized within hundredths of a second

• We can understand up to 50 phonemes/sec (normal speech 10-15ph/sec)– but if sound is repeated 20 times /sec we hear continuous

buzz!

• All aspects of language processing are involved and manage to keep apace

Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London

Page 19: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

19

CSC 8520 Spring 2010. Paula Matuszek

Why Language is Hard• NLP is AI-complete• Abstract concepts are difficult to represent

• LOTS of possible relationships among concepts

• Many ways to represent similar concepts

• Tens of hundreds or thousands of features/dimensions

Page 20: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

20

CSC 8520 Spring 2010. Paula Matuszek

Why Language is Easy• Highly redundant

• Many relatively crude methods provide fairly good results

Page 21: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

21

CSC 8520 Spring 2010. Paula Matuszek

History of NLP• Prehistory (1940s, 1950s)

– automata theory, formal language theory, markov processes (Turing, McCullock&Pitts, Chomsky)

– information theory and probabilistic algorithms (Shannon)– Turing test – can machines think?

• Early work:– symbolic approach

• generative syntax - eg Transformations and Discourse Analysis Project (TDAP- Harris)• AI – pattern matching, logic-based, special-purpose systems

– Eliza Rogerian therapist http://www.manifestation.com/neurotoys/eliza.php3

– stochastic• baysian methods

early successes $$$$ grants!by 1966 US government had spent 20 million on machine translation aloneCritics:– Bar Hillel – “no way to disambiguation without deep understanding”– Pierce NSF 1966 report: “no way to justify work in terms of practical output”

Page 22: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

22

CSC 8520 Spring 2010. Paula Matuszek

History of NLP• The middle ages (1970-1990)

– stochastic• speech recognition and synthesis (Bell Labs)

– logic-based• compositional semantics (Montague)• definite clause grammars (Pereira&Warren)

– ad hoc AI-based NLU systems• SHRDLU robot in blocks world (Winograd)• knowledge representation systems at Yale (Shank)

– discourse modeling• anaphora• focus/topic (Groz et al)• conversational implicature (Grice)

Page 23: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

23

CSC 8520 Spring 2010. Paula Matuszek

History of NLP• NLP Renaissance (1990-2000)

Lessons from phonology & morphology successes: – finite-state models are very powerful– probabilistic models pervasive– Web creates new opportunities and challenges– practical applications driving the field again

• 21st Century NLPThe web changes everything:– much greater use for NLP– much more data available

Page 24: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

24

CSC 8520 Spring 2010. Paula Matuszek

Information Retrieval

Page 25: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

25

CSC 8520 Spring 2010. Paula Matuszek

Finding Out About• There are many large corpora of information that people use.

The web is the obvious example. Others include:– scientific journals

– patent databases

– Medline

– Usenet groups

• People interact with all that information because they want to KNOW something; there is a question they are trying to answer or a piece of information they want.

• Information Retrieval, or IR, is the process of answering that information need.

• Simplest approach:– Knowledge is organized into chunks (pages or documents)

– Goal is to return appropriate chunks

Page 26: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

26

CSC 8520 Spring 2010. Paula Matuszek

Information Retrieval Systems• Goal of an information retrieval system is to

return appropriate chunks• Steps involve include

– asking a question– finding answers– evaluating answers– presenting answers

• Value of an IR tool depends on how well it does on all of these.

• Web search engines are the IR tools most familiar to most people.

Page 27: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

27

CSC 8520 Spring 2010. Paula Matuszek

Asking a question• Reflect some information need

• Query Syntax needs to allow information need to be expressed– Keywords

– Combining terms• Simple: “required”, NOT (+ and -)

• Boolean expressions with and/or/not and nested parentheses

• Variations: strings, NEAR, capitalization.

– Simplest syntax that works

– Typically more acceptable if predictable

• Another set of problems when information isn’t text: graphics, music

Page 28: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

28

CSC 8520 Spring 2010. Paula Matuszek

Finding the Information

• Goal is to retrieve all relevant chunks. Too time-consuming to do in real-time, so IR systems index pages.

• Two basic approaches– Index and classify by hand

– Automate

• For BOTH approaches deciding what to index on (e.g., what is a keyword) is a significant issue.

• Many IR tools like search engines provide both

Page 29: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

29

CSC 8520 Spring 2010. Paula Matuszek

IR Basics• A retriever collects a page or chunk. This may involve

spidering web pages, extracting documents from a DB, etc.

• A parser processes each chunk and extracts individual words.

• An indexer creates/updates a hash table which connects words with documents

• A searcher uses the hash table to retrieve documents based on words

• A ranking system decides the order in which to present the documents: their relevance

Page 30: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

30

CSC 8520 Spring 2010. Paula Matuszek

How Good Is The IR?

• Information Retrieval systems are evaluated with two basic metrics:– Precision: What percent of document returned

are actually relevant to the information need– Recall: what percent of documents relevant to

information need are returned

• Can’t typically measure these exactly; usually based on test sets.

Page 31: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

31

CSC 8520 Spring 2010. Paula Matuszek

Selecting Relevant Documents• Assume:

– we already have a corpus of documents defined.

– goal is to return a subset of those documents.

– Individual documents have been separated into individual files

• Remaining components must parse, index, find, and rank documents.

• Traditional approach is based on the words in the documents (predates the web)

Page 32: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

32

CSC 8520 Spring 2010. Paula Matuszek

Extracting Lexical Features• Process a string of characters

– assemble characters into tokens (tokenizer)– choose tokens to index

• Standard lexical analysis problem

• Lexical Analyser Generator, such as lex

Page 33: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

33

CSC 8520 Spring 2010. Paula Matuszek

Lexical Analyser• Basic idea is a finite state machine

• Triples of input state, transition token, output state

• Must be very efficient; gets used a LOT

0

1

2

blankA-Z

A-Z

blank, EOF

Page 34: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

34

CSC 8520 Spring 2010. Paula Matuszek

Design Issues for Lexical Analyser• Punctuation

– treat as whitespace?

– treat as characters?

– treat specially?

• Case– fold?

• Digits– assemble into numbers?

– treat as characters?

– treat as punctuation?

Page 35: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

35

CSC 8520 Spring 2010. Paula Matuszek

Lexical Analyser• Output of lexical analyser is a string of tokens

• Remaining operations are all on these tokens

• We have already thrown away some information; makes more efficient, but limits somewhat the power of our search

Page 36: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

36

CSC 8520 Spring 2010. Paula Matuszek

Stemming• Additional processing at the token level

– We covered earlier this semester

• Turn words into a canonical form:– “cars” into “car”

– “children” into “child”

– “walked” into “walk”

• Decreases the total number of different tokens to be processed

• Decreases the precision of a search, but increases its recall

Page 37: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

37

CSC 8520 Spring 2010. Paula Matuszek

Noise Words (Stop Words)• Function words that contribute little or

nothing to meaning

• Very frequent words– If a word occurs in every document, it is not

useful in choosing among documents– However, need to be careful, because this is

corpus-dependent

• Often implemented as a discrete list

Page 38: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

38

CSC 8520 Spring 2010. Paula Matuszek

Example Corpora• We are assuming a fixed corpus. Some sample

corpora:– Medline Abstracts

– Email. Anyone’s email.

– Reuters corpus

– Brown corpus

• Will contain textual fields, maybe structured attributes– Textual: free, unformatted, no meta-information. NLP

mostly needed here

– Structured: additional information beyond the content

Page 39: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

39

CSC 8520 Spring 2010. Paula Matuszek

Structured Attributes for Medline

• Pubmed ID

• Author

• Year

• Keywords

• Journal

Page 40: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

40

CSC 8520 Spring 2010. Paula Matuszek

Textual Fields for Medline

• Abstract– Reasonably complete standard academic

English – Capturing the basic meaning of document

• Title – Short, formalized – Captures most critical part of meaning

– Proxy for abstract

Page 41: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

41

CSC 8520 Spring 2010. Paula Matuszek

Structured Fields for Email• To, From, Cc, Bcc

• Dates

• Content type

• Status

• Content length

• Subject (partially)

Page 42: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

42

CSC 8520 Spring 2010. Paula Matuszek

Text fields for Email• Subject

– Format is structured, content is arbitrary. – Captures most critical part of content.

– Proxy for content -- but may be inaccurate.

• Body of email– Highly irregular, informal English. – Entire document, not summary. – Spelling and grammar irregularities. – Structure and length vary.

Page 43: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

43

CSC 8520 Spring 2010. Paula Matuszek

Indexing• We have a tokenized, stemmed sequence of

words

• Next step is to parse document, extracting index terms– Assume that each token is a word and we don’t

want to recognize any more complex structures than single words.

• When all documents are processed, create index

Page 44: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

44

CSC 8520 Spring 2010. Paula Matuszek

Basic Indexing Algorithm• For each document in the corpus

– Get the next token

– Create or update an entry in a list• doc ID, frequency.

• For each token found in the corpus– calculate #docs, total frequency

– sort by frequency

– Often called a “reverse index”, because it reverses the “words in a document” index to be a “documents containing words” index.

– May be built on the fly or created after indexing.

Page 45: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

45

CSC 8520 Spring 2010. Paula Matuszek

Fine Points• Dynamic Corpora (e.g., the web): requires incremental

algorithms

• Higher-resolution data (eg, char position). – Supports highlighting

– Supports phrase searching

– Useful in relevance ranking

• Giving extra weight to proxy text (typically by doubling or tripling frequency count)

• Document-type-specific processing– In HTML, want to ignore tags

– In email, maybe want to ignore quoted material

Page 46: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

46

CSC 8520 Spring 2010. Paula Matuszek

Choosing Keywords• Don’t necessarily want to index on every word

– Takes more space for index– Takes more processing time– May not improve our resolving power

• How do we choose keywords?– Manually– Statistically

• Exhaustivity vs specificity

Page 47: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

47

CSC 8520 Spring 2010. Paula Matuszek

Manually Choosing Keywords• Unconstrained vocabulary: allow creator of

document to choose whatever he/she wants– “best” match

– captures new terms easily

– easiest for person choosing keywords

• Constrained vocabulary: hand-crafted ontologies– can include hierarchical and other relations

– more consistent

– easier for searching; possible “magic bullet” search

Page 48: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

48

CSC 8520 Spring 2010. Paula Matuszek

Examples of Constrained Vocabularies

• ACM headings (www.acm.org/class/1998)• H: Information Retrieval

– H3: Information Storage and Retrieval

– H3.3: Information Search and Retrieval

» Clustering» Query formulation

» Relevance feedback » Search process etc.

• Medline Headings (www.nlm.nih.gov/mesh/meshhome.html)

• L: Information Science– L01: Information Science

– L01.700: Medical Informatics

– L01.700.508: Medical Informatics Applications

– L01.700.508.280: Information Storage and Retrieval

» Grateful Med [L01.700.508.280.400]

Page 49: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

49

CSC 8520 Spring 2010. Paula Matuszek

Automated Vocabulary Selection• Frequency: Zipf’s Law.

– Pn = 1/na, where Pn is the frequency of occurrence of the nth ranked item and a is close to 1

– Within one corpus, words with middle frequencies are typically “best”

• Document-oriented representation bias: lots of keywords/document

• Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.

Page 50: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

50

CSC 8520 Spring 2010. Paula Matuszek

Choosing Keywords

• “Best” depends on actual use; if a word only occurs in one document, may be very good for retrieving that document; not, however, very effective overall.

• Words which have no resolving power within a corpus may be best choices across corpora

• Not very important for web searching; more relevant for some text mining.

Page 51: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

51

CSC 8520 Spring 2010. Paula Matuszek

Keyword Choice for WWW• We don’t have a fixed corpus of documents

• New terms appear fairly regularly, and are likely to be common search terms

• Queries that people want to make are wide-ranging and unpredictable

• Therefore: can’t limit keywords, except possibly to eliminate stop words.

• Even stop words are language-dependent. So determine language first.

Page 52: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

52

CSC 8520 Spring 2010. Paula Matuszek

Comparing and Ranking Documents

• Once our IR system has retrieved a set of documents, we may want to

• Rank them by relevance – Which are the best fit to my query?– This involves determining what the query is

about and how well the document answers it

• Compare them – Show me more like this.– This involves determining what the document is

about.

Page 53: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

53

CSC 8520 Spring 2010. Paula Matuszek

Determining Relevance by Keyword

• The typical document retrieval query consists entirely of keywords.

• Retrieval can be binary: present or absent

• More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about?

• Simple strategies:– How many times does word occur in document?

– How close to head of document?

– If multiple keywords, how close together?

Page 54: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

54

CSC 8520 Spring 2010. Paula Matuszek

Keywords for Relevance Ranking

• Count: repetition is an indication of emphasis– Very fast (usually in the index)

– Reasonable heuristic

– Unduly influenced by document length

– Can be "stuffed" by web designers

• Position: Lead paragraphs summarize content– Requires more computation

– Also reasonably heuristic

– Less influenced by document length

Page 55: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

55

CSC 8520 Spring 2010. Paula Matuszek

Keywords for Relevance Ranking• Proximity for multiple keywords

– Requires even more computation

– Obviously relevant only if have multiple keywords

– Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all

• All keyword methods– Are computationally simple and adequately fast

– Are effective heuristics

– typically perform as well as in-depth natural language methods for standard IR

Page 56: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

56

CSC 8520 Spring 2010. Paula Matuszek

Comparing Documents• "Find me more like this one" really means that

we are using the document as a query.

• This requires that we have some conception of what a document is about overall.

• Depends on context of query. We need to– Characterize the entire content of this document

– Discriminate between this document and others in the corpus

Page 57: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

57

CSC 8520 Spring 2010. Paula Matuszek

Characterizing a Document: Term Frequency• A document can be treated as a sequence of words.• Each word characterizes that document to some extent.• When we have eliminated stop words, the most

frequent words tend to be what the document is about• Therefore: fkd (# of occurrences of word K in

document d) will be an important measure.

• Also called the term frequency

Page 58: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

58

CSC 8520 Spring 2010. Paula Matuszek

Characterizing a Document: Document Frequency

• What makes this document distinct from others in the corpus?

• The terms which discriminate best are not those which occur with high frequency!

• Therefore: Dk (# of documents in which word K

occurs) will also be an important measure.

• Also called the document frequency

Page 59: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

59

CSC 8520 Spring 2010. Paula Matuszek

TF*IDF• This can all be summarized as:

– Words are best discriminators when they• occur often in this document (term frequency)

• don’t occur in a lot of documents (document frequency)

• One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency

• There are multiple formulas for actually computing this. The underlying concept is the same in all of them.

Page 60: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

60

CSC 8520 Spring 2010. Paula Matuszek

Describing an Entire Document

• So what is a document about?

• TF*IDF: can simply list keywords in order of their TF*IDF values

• Document is about all of them to some degree: it is at some point in some vector space of meaning

Page 61: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

61

CSC 8520 Spring 2010. Paula Matuszek

Vector Space

• Any corpus has defined set of terms (index)

• These terms define a knowledge space• Every document is somewhere in that knowledge

space -- it is or is not about each of those terms.

• Consider each term as a vector. Then– We have an n-dimensional vector space– Where n is the number of terms (very large!)

– Each document is a point in that vector space

• The document position in this vector space can be treated as what the document is about.

Page 62: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

62

CSC 8520 Spring 2010. Paula Matuszek

Similarity Between Documents

• How similar are two documents?– Measures of association

• How much do the feature sets overlap?

• Modified for length: DICE coefficient– DICE(x,y) = 2 f(x,y) / ( f(x) + f(y) )

– # terms compared to intersection

• Simple Matching coefficient: take into account exclusions

– Cosine similarity• similarity of angle of the two document vectors

• not sensitive to vector length

Page 63: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

63

CSC 8520 Spring 2010. Paula Matuszek

Bag of Words• All of these techniques are what is known

as bag of words approaches. • Keywords treated in isolation• Difference between "man bites dog" and

"dog bites man" non-existent• If better discrimination is needed, IR

systems can add semantic tools– Use POS – Parse into basic NP VP structure– Requires that query be more complex.

Page 64: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

64

CSC 8520 Spring 2010. Paula Matuszek

Improvements

• The two big problems with short queries are:– Synonymy: Poor recall results from missing

documents that contain synonyms of search terms, but not the terms themselves

– Polysemy/Homonymy: Poor precision results from search terms that have multiple meanings leading to the retrieval of non-relevant documents.

Martin: www.cs.colorado.edu/~martin/csci5832.html

Page 65: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

65

CSC 8520 Spring 2010. Paula Matuszek

Query Expansion

• Find a way to expand a users query to automatically include relevant terms (that they should have included themselves), in an effort to improve recall– Use a dictionary/thesaurus– Use relevance feedback

Martin: www.cs.colorado.edu/~martin/csci5832.html

Page 66: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

66

CSC 8520 Spring 2010. Paula Matuszek

Dictionary/Thesaurus Example

Page 67: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

67

CSC 8520 Spring 2010. Paula Matuszek

Relevance Feedback• Ask user to identify a few documents which

appear to be related to their information need

• Extract terms from those documents and add them to the original query.

• Run the new query and present those results to the user.

• Typically converges quickly

Based on Martin: www.cs.colorado.edu/~martin/csci5832.html

Page 68: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

68

CSC 8520 Spring 2010. Paula Matuszek

Blind Feedback

• Assume that first few documents returned are most relevant rather than having users identify them

• Proceed as for relevance feedback

• Tends to improve recall at the expense of precision

Based on Martin: www.cs.colorado.edu/~martin/csci5832.html

Page 69: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

69

CSC 8520 Spring 2010. Paula Matuszek

Post-Hoc Analyses

• When a set of documents has been returned, they can be analyzed to improve usefulness in addressing information need– Grouped by meaning for polysemic queries

(using N-Gram-type approaches)– Grouped by extracted information (Named

entities, for instance)– Group into existing hierarchy if structured fields

available – Filtering (e.g., eliminate spam)

Page 70: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

70

CSC 8520 Spring 2010. Paula Matuszek

Additional IR Issues• In addition to improved relevance, can

improve overall information retrieval with some other factors:– Eliminate duplicate documents

– Provide good context

– Use ontologies to provide synonym lists

• For the web:– Eliminate multiple documents from one site

– Clearly identify paid links

Page 71: CSC 8520 Spring 2010. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010

71

CSC 8520 Spring 2010. Paula Matuszek

Summary

• Information Retrieval is the process of returning documents to meet a user’s information need based on a query

• Typical methods are BOW (bag of words) which rely on keyword indexing with little semantic processing

• NLP techniques used including tokenizing, stemming, some parsing.

• Results can be improved by adding semantic information (such as thesauri) and by filtering and other post-hoc analyses.