introduction & tokenization ling570 shallow processing techniques for nlp september 28, 2011

Introduction & Tokenization

Ling570Shallow Processing Techniques for NLP

September 28, 2011

RoadmapCourse Overview

Tokenization

Homework #1

Course Overview

Course InformationCourse web page:

http://courses.washington.edu/ling570Syllabus:

Schedule and readingsLinks to other readings, slides, links to class recordingsSlides posted before class, but may be revised

Catalyst tools: GoPost discussion board for class issuesCollectIt Dropbox for homework submission and TA

commentsGradebook for viewing all grades

http://courses.washington.edu/ling570

GoPost Discussion BoardMain venue for course-related questions, discussion

What not to post:Personal, confidential questions; Homework solutions

What to post: Almost anything else course-related

Can someone explain…? Is this really supposed to take this long to run?

Key location for class participationPost questions or answersYour discussion space: Sanghoun & I will not jump in

often

GoPostEmily’s 5-minute rule:

If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!

Mechanics:Please use your UW NetID as your user idPlease post early and often !

Don’t wait until the last minuteNotifications:

Decide how you want to receive GoPost postings

EmailShould be used only for personal or confidential

issuesGrading issues, extended absences, other problems

General questions/comments go on GoPost

Please send email from your UW account Include Ling570 in the subject If you don’t receive a reply in 24 hours, please follow-

up

Homework SubmissionAll homework should be submitted through CollectIt

Tar cvf hw1.tar hw1_dir

Homework due 11:45 Wednesdays

Late homework receives 10%/day penalty (incremental)

Most major programming languages accepted C/C++/C#, Java, Python, Perl, Ruby

If you want to use something else, please check first

Please follow naming, organization guidelines in HW

Expect to spend 10-20 hours/week, including HW docs

GradingAssignments: 90%

Class participation: 10%

No midterm or final exams

Grades in Catalyst Gradebook

TA feedback returned through CollectIt

Incomplete: only if all work completed up last two weeksUW policy

RecordingsAll classes will be recorded

Links to recordings appear in syllabusAvailable to all students, DL and in class

Please remind me to:Record the meeting (look for the red dot)Repeat in-class questions

Note: Instructor’s screen is projected in classAssume that chat window is always public

Contact InfoGina: Email: [email protected]

Office hour:Fridays: 10-11 (before Treehouse meeting)Location: Padelford B-201Or by arrangement

Available by Skype or Adobe ConnectAll DL students should arrange a short online meeting

TA: Sanghoun Song: Email: [email protected] hour: Time: TBD, see GoPostLocation:

mailto:[email protected]

mailto:[email protected]

Online OptionPlease check you are registered for correct section

CLMS online: Section AState-funded: Section BCLMS in-class: Section CNLT/SCE online (or in-class): Section D

Online attendance for in-class studentsNot more than 3 times per term (e.g. missed bus, ice)

Please enter meeting room 5-10 before start of classTry to stay online throughout class

Online TipIf you see:

You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. you do not have a Connect account but need to have

one. For UWEO students: If you have just created your UW NetID or just enrolled

in a course…..

Clear your cache, close and restart your browser

Course Description

Course PrerequisitesProgramming Languages:

Java/C++/Python/Perl/..

Operating Systems: Basic Unix/linux

CS 326 (Data structures) or equivalentLists, trees, queues, stacks, hash tables, …Sorting, searching, dynamic programming,..

Automata, regular expressions,…

Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….

TextbookJurafsky and Martin, Speech and Language

Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008Available from UW Bookstore, Amazon, etc

Reference: Manning and Schutze, Foundations of Statistical Natural Language Processing

Topics in Ling570Unit #1: Formal Languages and Automata (2-3

weeks)Formal languagesFinite-state AutomataFinite-state TransducersMorphological analysis

Unit #2: Ngram Language Models and HMMsNgram Language Models and SmoothingPart-of-speech (POS) tagging:

HMMNgram

Topics in Ling570Unit #3: Classification (2-3 weeks)

Intro to classificationPOS tagging with classifiersChunkingNamed Entity (NE) recognition

Other topics (2 weeks) Intro, tokenizationClustering Information ExtractionSummary

RoadmapMotivation:

Applications

Language and Thought

Knowledge of LanguageCross-cutting themes

Ambiguity, Evaluation, & Multi-linguality

Course Overview

Motivation: ApplicationsApplications of Speech and Language Processing

Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….

Shallow vs Deep Processing

Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)

Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking

Shallow vs Deep Processing

Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)

Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking

Deep processing (Ling 571)Relies on more elaborate linguistic representations

Deep syntactic analysis (Parsing)Rich spoken language understanding (NLU)

Shallow or Deep?Applications of Speech and Language Processing

Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….

Language & IntelligenceTuring Test: (1949) – Operationalize intelligence

Two contestants: human, computer Judge: humanTest: Interact via text questionsQuestion: Can you tell which contestant is human?

Crucially requires language use and understanding

Limitations of Turing TestELIZA (Weizenbaum 1966)

Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT

AGGRESSIVE…



AGGRESSIVE...Passes the Turing Test!! (sort of)



AGGRESSIVE...Passes the Turing Test!! (sort of)“You can fool some of the people....”


Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE...

Passes the Turing Test!! (sort of)“You can fool some of the people....”

Simple pattern matching techniqueVery shallow processing

Turing Test Revived“On the web, no one knows you’re a….”

Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human

Test: Something human can do, ‘bot can’t




Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot




Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot

Key: Perception, not reasoning

Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey)

need to know to converse?

Dave: Open the pod bay doors, HAL.

HAL: I'm sorry, Dave. I'm afraid I can't do that.

Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to

know to converse?



Phonetics & Phonology (Ling 450/550) Sounds of a language, acoustics Legal sound sequences in words

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to

converse?



Morphology (Ling 570) Recognize, produce variation in word forms Singular vs. plural: Door + sg: -> door; Door + plural -> doors Verb inflection: Be + 1st person, sg, present -> am

Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to

know to converse?



Part-of-speech tagging (Ling 570) Identify word use in sentence Bay (Noun) --- Not verb, adjective

Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know

to converse?



Syntax (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) Order and group words in sentence

I’m I do , sorry that afraid Dave I can’t.


converse?



Semantics (Ling 571) Word meaning:

individual (lexical), combined (compositional)

‘Open’ : AGENT cause THEME to become open;

‘pod bay doors’ : (pod bay) doors


converse?

Dave: Open the pod bay doors, HAL. (request)

HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement)

Pragmatics/Discourse/Dialogue (Ling 571, maybe) Interpret utterances in context Speech act (request, statement) Reference resolution: I = HAL; that = ‘open doors’ Politeness: I’m sorry, I’m afraid I can’t

Cross-cutting ThemesAmbiguity

How can we select among alternative analyses?

Evaluation How well does this approach perform:

On a standard data set?When incorporated into a full system?

Multi-linguality Can we apply this approach to other languages? How much do we have to modify it to do so?

Ambiguity“I made her duck”

Means....


Means.... I caused her to duck down


Means.... I caused her to duck down I made the (carved) duck she has


Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her


Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned


Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck

Ambiguity: POS“I made her duck”

Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck

V

N

Pron

Poss

Ambiguity: Syntax“I made her duck”

Means.... I made the (carved) duck she has

((VP (V made) (NP (POSS her) (N duck)))

Ambiguity: Syntax“I made her duck”

Means.... I made the (carved) duck she has

((VP (V made) (NP (POSS her) (N duck)))

I cooked duck for her((VP (V made) (NP (PRON her)) (NP (N (duck)))

AmbiguityPervasive

Pernicious

Particularly challenging for computational systems

Problem we will return to again and again in class

TokenizationGiven input text, split into words or sentences

Tokens: words, numbers, punctuation

Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "

Why tokenize?

TokenizationGiven input text, split into words or sentences

Tokens: words, numbers, punctuation

Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "

Why tokenize? Identify basic units for downstream processing

TokenizationProposal 1: Split on white space

Good enough


Good enough? No

Why not?


Good enough? No

Why not?Multi-linguality:

Languages without white space delimiters: Chinese, Japanese

Agglutinative languages (Finnish)Compounding languages (German)

E.g. Lebensversicherungsgesellschaftsangestellter

“Life insurance company employee”


Good enough? No

Why not?Multi-linguality:

Languages without white space delimiters: Chinese, JapaneseAgglutinative languages (Finnish)Compounding languages (German)

E.g. Lebensversicherungsgesellschaftsangestellter

“Life insurance company employee”Even with English, misses punctuation

Tokenization - againProposal 2: Split on white space and punctuation

For English

Good enough?


For English

Good enough? No

Problems:


For English

Good enough? No

Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc


For English

Good enough? No


Problems: no-splitting whitespaceNames: New York; Collocations: pick up


For English

Good enough? No


Problems: no-splitting whitespaceNames: New York; Collocations: pick up

What’s a word?

Sentence SegmentationSimilar issues

Proposal: Split on ‘.’

Problems?



Problems?Other punctuation: ?,!,etcNon-boundary periods:

1.23Mr. SherwordM.p.g.….

Solutions?



Problems? Other punctuation: ?,!,etc Non-boundary periods:

1.23Mr. SherwordM.p.g.….

Solutions? Rules, dictionaries esp. abbreviations Marked up text + machine learning

Homework #1

General Notes and Naming

All code must run on the CL cluster under Condor

Each programming question needs a corresponding condor command file named:E.g. hw1_q1.cmdAnd should run with: $ condor_submit hw1_q1.cmd

General comments should appear in hw1.(pdf|txt)

Your code may be tested on new data

Q1-Q2: Building and Testing a Rule-Based English

TokenizerQ1: eng_tokenize.*

Condor file must contain Input = <input_file> Output = <output_file> You may use additional arguments ith input line should correspond to the ith output line Don’t waste too much time making it perfect!!

Q2: Explain how your tokenizer handles different

phenomena – numbers, hyphenated words, etc – and identify remaining problems.

Q3: make_vocab.*Given the input text:

The teacher bought the bike. The bike is expensive.

The output should be:The 2

the 1

teacher 1

bought 1

bike. 1

bike 1

is 1

expensive. 1

Q4: Investigating Tokenization

Run programs from Q1, Q3

Compare vocabularies with and without tokenization

Next TimeProbability

Formal languages

Finite-state automata

introduction & tokenization ling570 shallow processing techniques for nlp september 28, 2011

Documents

class slide

class questions

class students

course overview slide

public slide

followup slide

class recordings slides

class issues collectit