cse 574 – artificial intelligence ii (nlp) ee 517

CSE 574 – Artificial Intelligence II (NLP)EE 517 – Statistical Language Processing

Prof. Luke Zettlemoyer (CSE)Prof. Mari Ostendorf (EE)

[Numerous slides adapted from Regina Barzilay]

3 Jan -- Overview

• Course structure• Natural language processing (NLP)• Syllabus overview• Focus on statistical (data-driven) methods• Issues in corpus-based work

CSE/EE Course Combination

• Why the course merger?– We’re accidentally teaching the same topics– Want to develop one that can be cross-listed

• Complication: – EE course has 4 units, CSE has 3– Solution: extra 1 hr per week paper discussion

(required for EE, optional for CSE)• Grading and project advising will be

handled by faculty member in your dept

Course Info• Web page: https://catalyst.uw.edu/workspace/lsz/18191/• Schedule

– MW 1:30-2:50 lecture, T 4:30-5:20 discussion– Finals week project presentation

• Goals: – Understand theoretical foundation of key algorithms– Gain practical experience with system & experiment design

trade-offs– Build technical communication skills related to NLP

• Book – several resources provided, not required but highly recommended

https://catalyst.uw.edu/workspace/lsz/18191/

https://catalyst.uw.edu/workspace/lsz/18191/

Course Info (cont.)

• Expectations:– Computer labs: 40% CSE, 35% EE

• 2 competitive labs on common data (language modeling, text classification)

• 1 project-related lab (demonstrate feasibility)– Project: 60% CSE, 55% EE

• Project proposal – week 4• Written report – week 10• Presentation – finals week

– Paper discussions: 10% EE

What is NLP? (from Google)• Natural Language Processing

– the branch of information science that deals with natural language informationwordnetweb.princeton.edu/perl/webwn

– instead of using Boolean logic, the user simply can type in a question as a query…. www.microsoft.com/enterprisesearch/en/us/search-glossary.aspx

– A range of computational techniques for analyzing and representing naturally occurring text (free text) at one or more levels of linguistic analysis (eg, morphological, syntactic, semantic, pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive ...library.ahima.org/xpedio/groups/public/documents/ahima/bok1_025042.hcsp

• Ignoring… Neuro-Linguistic Programming, National Labor Party, Nonlinear Programming, No-Longer Polymers, …

http://www.google.com/url?ei=VikhTdqoPIm4sAO2tOnEAg&sig2=QA9s4er_-AnSG-maef_wow&q=http://wordnetweb.princeton.edu/perl/webwn%2525253Fs%2525253Dnlp&sa=X&ved=0CBgQpAMoAA&usg=AFQjCNFz9m5Mp_IXcC32SCsb7Jt59aKiuA

http://www.google.com/url?ei=VikhTdqoPIm4sAO2tOnEAg&sig2=QA9s4er_-AnSG-maef_wow&q=http://wordnetweb.princeton.edu/perl/webwn%2525253Fs%2525253Dnlp&sa=X&ved=0CBgQpAMoAA&usg=AFQjCNFz9m5Mp_IXcC32SCsb7Jt59aKiuA

http://www.microsoft.com/enterprisesearch/en/us/search-glossary.aspx




http://www.google.com/url?ei=VikhTdqoPIm4sAO2tOnEAg&sig2=zpl9GhksUWzmAHt-nGLjTQ&q=http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_025042.hcsp&sa=X&ved=0CBwQpAMoBA&usg=AFQjCNEboY5D0XW-S4DbdBPA5LrfyJUvtg




What is NLP? (cont)

Computer processing of human language

Computer LanguageLanguage

NL Info Extraction & Understanding

NL Generation

NL Transformation(translation, paraphrasing)

What is NLP? (cont)

• May be for a variety of needs– Human-computer interaction– Computer-mediated human-human interaction– Information management & mining– Computer-based education and training

• Different levels of work– Core sub-problems (linguistic analysis)– Applications (summarization, question

answering, translation, dialog systems, …)

NLP is AI-complete• To solve every possible NLP problem

– need to solve all of artificial intelligence– basis for the Turing test:

Turing (1950): “I believe that in about fifty yearsʼ it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.”

– luckily, we don’t need to solve it all in one step...

Information Extraction

10TH DEGREE is a full service advertising agency specializing in direct and interactive marketing. Located in Irvine CA, 10TH DEGREE is looking for an Assistant Account Manager to help manage and coordinate interactive marketing initiatives for a marquee automative account. Experience in online marketing, automative and/or the advertising field is a plus. Assistant Account Manager Responsibilities Ensures smooth implementation of programs and initiatives Helps manage the delivery of projects and key client deliverables ... Compensation: $50,000-$80,000

Goal: Build database entries from text

INDUSTRY AdvertisingPOSITION Assistant Account ManagerLOCATION Irvine, CACOMPANY 10th Degree

Question Answering Goal: Provide structured answers to user queries

!"#$%&'()*($+#,&(-.&(/)0($+#,$)%')-#(#,01)2'34,#5#($&'()6"#$%&'()&()0)/'2"3#(%)2'11#2%&'(

Machine Translation

One of the oldest NLP problems, started with code breaking techniques in the 1950s

!"#$%&'()*"&+,"-%.&

!""#$%&'()*+$),-"*

Dialog Goal: Participate in (goal-driven) conversations

!"#$%&'()*+,-).+%/0'01)2(3'454#406

!"#$%&'()*+,-).+%/0'01)2(3'454#406

One of the early NLP applications for AI researchers (SHRDLU: Winograd, 72)

Deciphering lost languages: Ugaritic

Knowledge engineering bottleneck!"#$%&'#(&)*+$,-$$!"#$%&

.#/01(#+ #*)2,()'#$3)&0)**4$#&5,6#6$7&,8*#69#$(#:(#+#&')'1,&

!"#$%&'()*+,-).+%/0'01)2(3'454#406

We need: • Knowledge about language• Knowledge about the world

Possible solutions:• Manual engineering approach:

encode all the required information into computer

• Statistical / ML approach: infer language properties from examples of language use

NLP History: pre-stat. / ML“(1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless.It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) had ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally "remote" from English. Yet (1), though nonsensical, is grammatical, while (2) is not.” (Chomsky 1957)

1970ʼs and 1980ʼs: non-statistical NLP • emphasis on deeper models, syntax • toy domains / manual grammars (SHRDLU, etc.) • weak empirical evaluation

NLP: statistical / ML approaches

“Whenever I fire a linguist our system performanceimproves. ” (Jelinek 1988)

1990ʼs: The Empirical Revolution• Corpus-based methods yield the first generation of widely

used NL tools (syntax, MT, ASR)• Deep analysis is often traded for robust approximations• Empirical evaluation is crucial

2000ʼs: Richer linguistic representations embedded in the statistical framework

Topics Covered

• High level:– Foundational material– Important general statistical models (no

linguistics required)– Core NLP sub-problems– Selected NLP applications

Foundational Material

• Issues in corpus-based work (this week)• Mathematical background: on your own!

– Basic probability (pre-req, see also M&S 2.1)– Estimation & detection theory (key results

reviewed in general model section)– Information theory (key concepts in M&S 2.2)

• Basic linguistics (later)

Important General Methods

• Handling variable-length sequences– N-grams & extensions– Bag-of-words & extensions– HMMs, CRFs, …

• Models for vector observations– SVMs, log-linear models– Classification vs. reranking

• Intro to different learning strategies

Core NLP Sub-Problems• Part of speech tagging• Sentiment Classification• Grammars and Parsing• Formal Semantics

Selected NLP Applications• Dialog systems• Translation

Why the Statistical Approach?

• Ambiguity in language (one text can have different meanings)

• Variability in language (one meaning can be expressed with different words)

Need for “Ignorance Modeling”

Why are these funny?

• Iraqi Head Seeks Arms • Ban on Nude Dancing on Governorʼs Desk • Juvenile Court to Try Shooting Defendant• Teacher Strikes Idle Kids• Stolen Painting Found by Tree • Kids Make Nutritious Snacks • Local HS Dropout Cut in Half • Hospitals Are Sued by 7 Foot Doctors

Ambiguity(Example from Jurafsky & Martin)

• I made her duck.• Possible interpretations of the text out of context:

– I cooked waterfowl for her.– I cooked the waterfowl that is on the plate in front of

her.– I created a toy (or decorative) waterfowl for her.– I caused her to quickly lower her head.

• Possible variations in spoken forms:– I made HER duck. vs. I made her DUCK.– I made her duck? (doubt, disbelief) vs. statement form– Ai made her duck. (where “Ai” is a name of a person)– A maid heard “uck”.

Ambiguity(Example from Lillian Lee)

• “Finally a computer that understands you like your mother’’– Possible interpretations?– Different syntax

• understands [(that) you like your mother]• understands [you] [like your mother (does)]

– Different word senses• Female parent; a source or origin; slimy substance

added to cider or wine to make vinegar– Overall statement is vague / requires

knowledge to understand• Does your mother understand you well, or poorly?

Variation• Different ways of saying the same thing

– The chicken crossed the road.– The road was crossed by the chicken.– The chicken has traversed the road.– Across the road went the chicken.– The daughter of the rooster made it to the other side

of the street.– A chick- uh I mean the chicken you know like crossed

the the road.• Variation can involve syntactic or word choices• Depends on modality, genre, topic, author, …

Variation: Reading Level• While the Portuguese Man o' War resembles a jellyfish, it is in

fact a siphonophore - a colony of four kinds of minute, highly modified individuals, which are specialized polyps and medusoids. Each such zooid in these pelagic colonial hydroids or hydrozoans has a high degree of specialization and, although structurally similar to other cnidarians, are all attached to each other and physiologically integrated rather than living independently.

• The Portuguese Man o' War looks like a jellyfish, but it is really not. It is a siphonophore. This is a colony of four kinds of zooids. Zooids are very small, highly modified individuals. These zooids are structurally similar to other solitary animals, but the zooids do not live by themselves. Instead, they are attached to each other.

Variation – Translation• Mr. Chang Jun Hsung, chairman of the Executive branch, expressed

during a ceremony to celebrate the forming of the multi-party Association For … at the Legislative branch, that his idea is to use cooperation instead of confrontation because the political culture of confrontation and opposition of the past has cost a lot of people dearly.

• Mr. Chang Jun Hsung, chairman of the Executive branch, announced today, during the celebration of the inaugural meeting of the multi-party Association For … at the Legislative branch, that he is replacing confrontation with cooperation because in the past the political culture of confrontation has cost a lot people.

• Chairman Chang Jun Hsung of the Executive branch, while attending a ceremony to celebrate the forming of the multi-party Association For … at the Legislative branch, expressed that his idea is to replace confrontation with cooperation because the past political culture of confrontation and opposition has cost a lot people dearly.

Observation

• Variation impacts performance evaluation as well as system design

• For many problems, there may be more than one “correct” answer.

Ignorance Modeling• The basic idea:

– acknowledge that you don’t yet have rules that account for all sources of variability/ambiguity

– Allow for different alternatives in the model; use data-driven learning

• Examples:– From speech recognition: Gaussian mixture models for

observation distributions can represent a range of pronunciations for a given word.

– From language processing: • Probabilistic grammars do better than deterministic grammars at

handling disfluencies • Grammar checking -- consider determiner case study, next

Case Study: Determiner PlacementTask: Automatically place determiners: “a”, “the”, or null

Scientists in United States have found way of turning lazy monkeys into workaholics using gene therapy. Usually monkeys work hard only when they know reward is coming, but animals given this treatment did their best all time. Researchers at National Institute of Mental Health nearWashington DC, led by Dr Barry Richmond, have now developed genetic treatment which changes their work ethic markedly. "Monkeys under influence of treatment don't procrastinate," Dr Richmond says. Treatment consists of anti-sense DNA - mirror image of piece of one of our genes - and basically prevents that gene from working. But for rest of us, day when such treatments fall into hands of our bosses may be one we would prefer to put off.

How do we choose a determiner?Largely determined by: – Type of noun (countable, uncountable) – Uniqueness of reference – Information value (given, new)– Number (singular, plural)

However, there are many exceptions and special cases:– The definite article is used with newspaper titles (The Times), but zero article in names of magazines and journals (Time)– Highway names vary by region: I-5 vs. the I-5

Hard to manually encode this information!

A Simple Statistical Approach

• Collect a large collection of texts relevant to your domain (e.g. newspaper text)

• For each noun seen during training, compute its probability to take a certain determiner

• Given a new noun, select a determiner with the highest likelihood as estimated on the training corpus

A Classification Approach• Predict: {“the”, “a”, null}• Define a problem representation (features):

- plural? (yes/no)- first appearance? (yes/no)- head word token

Goal: Learn classification function that can predict unseen examples

Plural? First? Word DeterminerN Y defendant aY N cars nullN N FBI the

How well does it work?• Implementation details:

- Training --- first 21 sections of the Wall Street Journal corpus, testing -- the 23th section

- Prediction accuracy: 71.5%

• The results are not great, but surprisingly high for such a simple method- A large fraction of nouns in this corpus always

appear with the same determiner- for example: ``the FBI'', ``the defendant''

Limitations of Data Alone

怎么老是你 How come it’s always you?

MTHow old are you?How to keep yourHow always you

3 online data-driven systems:

Too many possible combinations to rely simply on counts in a corpus

e.g. V=100k V5=1025

Never Enough Data…

• Language has “lopsided sparsity” – infrequent events happen frequently

• Larger units (e.g. word tuples vs. words) require more data

• Counts of one corpus may not generalize to another domain, e.g.– Web is not representative of children’s articles– Newswire has very few “uh” and “um” and

relatively few “I” and “you”, compared to conversational speech

The NLP Cycle• Gather / find a corpus• Build a baseline• Repeat:- Analyze most common errors- Think of ways to fix - Modify the model

‣ Add new features‣ Change the structure of the model‣ Use a new learning method

Issues for Corpus-Based Methods

• Different types of data• Honest estimates of performance• Text pre-processing

Types of Data• Different sources:

– Documents, audio recordings, dictionaries, the Web, …• Different forms

– Text: newswire, blogs, email, chat– Speech: talk shows, speeches, call centers, hearings– Multimodal: speech&video, text&images, …

• Different units on which to base quantity– words for language modeling– sentences for parsing– sentence pairs for translation– articles for text classification– article collections for multi-doc summarization

Text Corpora

Antique corpus: • Rosetta Stone

Examples of corpora used today: • Penn Treebank: 1M words of

parsed text• Brown Corpus: 1M words of

tagged text • North American News: 300M

words • English Gigaword: 3.5B words• The Web

Corpus for MTPairs of parallel sentences. For example, one sentence from the Europarl corpus (Koehn, 2005):

Danish: det er næsten en personlig rekord for mig dette efterar .German: das ist fur mich fast personlicher rekord in diesem herbst .Greek:English that is almost a personal record for me this autumn !Spanish: es la mejor marca que he alcanzado este otono .Finnish: se on melkein minun ennatykseni tana syksyna !French: c ’ est pratiquement un record personnel pour moi , cet automne !Italian: e ’ quasi il mio record personale dell ’ autunno .Dutch: dit is haast een persoonlijk record deze herfst .Portuguese: e quase o meu recorde pessoal deste semestre !Swedish: det ar nastan personligt rekord for mig denna host !

Figure 2: One sentence aligned across 11 languages

Note that this data is also lowercased, which isnot done for the released sentence aligned data. Al-ternatively, true casing could be applied, althoughthis is a more difficult task.

2.6 Releases of the CorpusThe initial release of this corpus consisted of data upto 2001. The second release added data up to 2003,increasing the size from just over 20 million wordsto up to 30 million words per language. A forthcom-ing third release will include data up to early 2005and will have better tokenisation. For more details,please check the website.

3 110 SMT SystemsThe prevailing methodology in statistical machinetranslation (SMT) has progressed from the initialword-based IBM Models [Brown et al., 1993] tocurrent phrase-based models [Koehn et al., 2003].To describe the latter quickly: When translating asentence, source language phrases (any sequencesof words) are mapped into phrases in the target lan-guage, as specified by a probabilistic phrase trans-lation table. Phrases may be reordered, and a lan-guage model in the target language supports fluentoutput.The core of this model is the probabilistic phrase

translation table that is learned from a parallel cor-pora. There are various methods to train this, severalstart with a automatically obtained word alignmentand then collect phrase pairs of any length that areconsistent with the word alignment.Decoding is a beam search over all possible seg-

mentation of the input into phrases, any translationfor each phrase, and any reordering. Additionalcomponent models aid in scoring alternative transla-tions. Translation speed in our case is a few secondsper sentence.Fuelled by annual competitions and an active re-

search community, we can observe rapid progress

in the field. Due to the involvement of US fundingagencies, most research groups focus on the transla-tion from Arabic to English and Chinese to English.Next to text-to-text translation, there is increasinginterest in speech-to-text translation.

Most systems are largely language-independent,and building a SMT system for a new languagepair is mostly a matter of availability of paralleltexts. Our efforts to explore open-domain German–English SMT led us to collecting data from the Eu-ropean Parliament. Incidentally, the existence oftranslations in 11 languages now enabled us to buildtranslation systems for all 110 language pairs.

Our SMT system [Koehn et al., 2003] includesthe decoder Pharaoh [Koehn, 2004], which is freelyavailable for research purposes3. Training 110 sys-tems took about 3 weeks on a 16-node Linux clus-ter. We evaluated the quality of the system with thewidely used BLEU metric [Papineni et al., 2002],which measures overlap with a reference transla-tion.

We tested on a 2000 sentences held-out test set,which is drawn from text from sessions that tookpart the last quarter of the year 2000. These sen-tences are aligned across all 11 languages, so whentranslation the, say, French sentences into Danish,we can compare the output against the Danish set ofsentences. The same test set was used in a sharedtask at the 2005 ACL Workshop on Parallel Texts[Koehn, 2005].

The scores for the 110 systems are displayed inTable 2. According to these numbers, the easiesttranslation direction is Spanish to French (BLEUscore of 40.2), the hardest Dutch to Finnish (10.3).

3Available online at http://www.isi.edu/licensed-sw/pharaoh/

Corpus for ParsingFrom the Penn Treebank:

Canadian Utilities had 1988 revenue of $ 1.16 billion , mainly from its natural gas and electric utility businesses in Alberta, where the company serves about 800,000 customers .

!"#$%&'("#')*#&+,-

!*,*.+*,'/0+1+0+2&'3*.'4566'#272,%2'"(''8'494:';+11+",'<'=*+,1>'(#"='+0&',*0%#*1'-*&'*,.'212?0#+?'%0+1+0>';%&+,2&&2&'+,'@1;2#0*<'A32#2'032'?"=$*,>'&2#72&'

*;"%0'6BB<BBB'?%&0"=2#&'9

Honest Evaluation

• Data is usually divided into 3 or more sets:– Training/learning set– Development test/tuning/validation set– Evaluation test set

• To have an unbiased performance estimate, need to test on data not used in training

• Most models have parameters that can be tuned requires more independent data

Never Enough Data… Revisited

• Language has “lopsided sparsity”…• Larger units (e.g. sentences vs. tuples)

require more data• Hand annotated data can be expensive,

especially for larger units• Workarounds:

– Cross-validation in learning/tuning (in lab 2)– Learning with some unlabeled data (later

lecture)