introduction & tokenization ling570 shallow processing techniques for nlp september 28, 2011
Post on 19-Dec-2015
234 views
TRANSCRIPT
Introduction & Tokenization
Ling570Shallow Processing Techniques for NLP
September 28, 2011
RoadmapCourse Overview
Tokenization
Homework #1
Course Overview
Course InformationCourse web page:
http://courses.washington.edu/ling570Syllabus:
Schedule and readingsLinks to other readings, slides, links to class recordingsSlides posted before class, but may be revised
Catalyst tools: GoPost discussion board for class issuesCollectIt Dropbox for homework submission and TA
commentsGradebook for viewing all grades
GoPost Discussion BoardMain venue for course-related questions, discussion
What not to post:Personal, confidential questions; Homework solutions
What to post: Almost anything else course-related
Can someone explain…? Is this really supposed to take this long to run?
Key location for class participationPost questions or answersYour discussion space: Sanghoun & I will not jump in
often
GoPostEmily’s 5-minute rule:
If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!
Mechanics:Please use your UW NetID as your user idPlease post early and often !
Don’t wait until the last minuteNotifications:
Decide how you want to receive GoPost postings
EmailShould be used only for personal or confidential
issuesGrading issues, extended absences, other problems
General questions/comments go on GoPost
Please send email from your UW account Include Ling570 in the subject If you don’t receive a reply in 24 hours, please follow-
up
Homework SubmissionAll homework should be submitted through CollectIt
Tar cvf hw1.tar hw1_dir
Homework due 11:45 Wednesdays
Late homework receives 10%/day penalty (incremental)
Most major programming languages accepted C/C++/C#, Java, Python, Perl, Ruby
If you want to use something else, please check first
Please follow naming, organization guidelines in HW
Expect to spend 10-20 hours/week, including HW docs
GradingAssignments: 90%
Class participation: 10%
No midterm or final exams
Grades in Catalyst Gradebook
TA feedback returned through CollectIt
Incomplete: only if all work completed up last two weeksUW policy
RecordingsAll classes will be recorded
Links to recordings appear in syllabusAvailable to all students, DL and in class
Please remind me to:Record the meeting (look for the red dot)Repeat in-class questions
Note: Instructor’s screen is projected in classAssume that chat window is always public
Contact InfoGina: Email: [email protected]
Office hour:Fridays: 10-11 (before Treehouse meeting)Location: Padelford B-201Or by arrangement
Available by Skype or Adobe ConnectAll DL students should arrange a short online meeting
TA: Sanghoun Song: Email: [email protected] hour: Time: TBD, see GoPostLocation:
Online OptionPlease check you are registered for correct section
CLMS online: Section AState-funded: Section BCLMS in-class: Section CNLT/SCE online (or in-class): Section D
Online attendance for in-class studentsNot more than 3 times per term (e.g. missed bus, ice)
Please enter meeting room 5-10 before start of classTry to stay online throughout class
Online TipIf you see:
You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. you do not have a Connect account but need to have
one. For UWEO students: If you have just created your UW NetID or just enrolled
in a course…..
Clear your cache, close and restart your browser
Course Description
Course PrerequisitesProgramming Languages:
Java/C++/Python/Perl/..
Operating Systems: Basic Unix/linux
CS 326 (Data structures) or equivalentLists, trees, queues, stacks, hash tables, …Sorting, searching, dynamic programming,..
Automata, regular expressions,…
Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….
TextbookJurafsky and Martin, Speech and Language
Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008Available from UW Bookstore, Amazon, etc
Reference: Manning and Schutze, Foundations of Statistical Natural Language Processing
Topics in Ling570Unit #1: Formal Languages and Automata (2-3
weeks)Formal languagesFinite-state AutomataFinite-state TransducersMorphological analysis
Unit #2: Ngram Language Models and HMMsNgram Language Models and SmoothingPart-of-speech (POS) tagging:
HMMNgram
Topics in Ling570Unit #3: Classification (2-3 weeks)
Intro to classificationPOS tagging with classifiersChunkingNamed Entity (NE) recognition
Other topics (2 weeks) Intro, tokenizationClustering Information ExtractionSummary
RoadmapMotivation:
Applications
Language and Thought
Knowledge of LanguageCross-cutting themes
Ambiguity, Evaluation, & Multi-linguality
Course Overview
Motivation: ApplicationsApplications of Speech and Language Processing
Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….
Shallow vs Deep Processing
Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)
Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking
Shallow vs Deep Processing
Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)
Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking
Deep processing (Ling 571)Relies on more elaborate linguistic representations
Deep syntactic analysis (Parsing)Rich spoken language understanding (NLU)
Shallow or Deep?Applications of Speech and Language Processing
Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….
Language & IntelligenceTuring Test: (1949) – Operationalize intelligence
Two contestants: human, computer Judge: humanTest: Interact via text questionsQuestion: Can you tell which contestant is human?
Crucially requires language use and understanding
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT
AGGRESSIVE…
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT
AGGRESSIVE...Passes the Turing Test!! (sort of)
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT
AGGRESSIVE...Passes the Turing Test!! (sort of)“You can fool some of the people....”
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE...
Passes the Turing Test!! (sort of)“You can fool some of the people....”
Simple pattern matching techniqueVery shallow processing
Turing Test Revived“On the web, no one knows you’re a….”
Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human
Test: Something human can do, ‘bot can’t
Turing Test Revived“On the web, no one knows you’re a….”
Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human
Test: Something human can do, ‘bot can’t
Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot
Turing Test Revived“On the web, no one knows you’re a….”
Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human
Test: Something human can do, ‘bot can’t
Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot
Key: Perception, not reasoning
Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey)
need to know to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to
know to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Phonetics & Phonology (Ling 450/550) Sounds of a language, acoustics Legal sound sequences in words
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to
converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Morphology (Ling 570) Recognize, produce variation in word forms Singular vs. plural: Door + sg: -> door; Door + plural -> doors Verb inflection: Be + 1st person, sg, present -> am
Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to
know to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Part-of-speech tagging (Ling 570) Identify word use in sentence Bay (Noun) --- Not verb, adjective
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know
to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Syntax (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) Order and group words in sentence
I’m I do , sorry that afraid Dave I can’t.
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to
converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Semantics (Ling 571) Word meaning:
individual (lexical), combined (compositional)
‘Open’ : AGENT cause THEME to become open;
‘pod bay doors’ : (pod bay) doors
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to
converse?
Dave: Open the pod bay doors, HAL. (request)
HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement)
Pragmatics/Discourse/Dialogue (Ling 571, maybe) Interpret utterances in context Speech act (request, statement) Reference resolution: I = HAL; that = ‘open doors’ Politeness: I’m sorry, I’m afraid I can’t
Cross-cutting ThemesAmbiguity
How can we select among alternative analyses?
Evaluation How well does this approach perform:
On a standard data set?When incorporated into a full system?
Multi-linguality Can we apply this approach to other languages? How much do we have to modify it to do so?
Ambiguity“I made her duck”
Means....
Ambiguity“I made her duck”
Means.... I caused her to duck down
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck
Ambiguity: POS“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck
V
N
Pron
Poss
Ambiguity: Syntax“I made her duck”
Means.... I made the (carved) duck she has
((VP (V made) (NP (POSS her) (N duck)))
Ambiguity: Syntax“I made her duck”
Means.... I made the (carved) duck she has
((VP (V made) (NP (POSS her) (N duck)))
I cooked duck for her((VP (V made) (NP (PRON her)) (NP (N (duck)))
AmbiguityPervasive
Pernicious
Particularly challenging for computational systems
Problem we will return to again and again in class
TokenizationGiven input text, split into words or sentences
Tokens: words, numbers, punctuation
Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "
Why tokenize?
TokenizationGiven input text, split into words or sentences
Tokens: words, numbers, punctuation
Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "
Why tokenize? Identify basic units for downstream processing
TokenizationProposal 1: Split on white space
Good enough
TokenizationProposal 1: Split on white space
Good enough? No
Why not?
TokenizationProposal 1: Split on white space
Good enough? No
Why not?Multi-linguality:
Languages without white space delimiters: Chinese, Japanese
Agglutinative languages (Finnish)Compounding languages (German)
E.g. Lebensversicherungsgesellschaftsangestellter
“Life insurance company employee”
TokenizationProposal 1: Split on white space
Good enough? No
Why not?Multi-linguality:
Languages without white space delimiters: Chinese, JapaneseAgglutinative languages (Finnish)Compounding languages (German)
E.g. Lebensversicherungsgesellschaftsangestellter
“Life insurance company employee”Even with English, misses punctuation
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough?
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems:
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc
Problems: no-splitting whitespaceNames: New York; Collocations: pick up
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc
Problems: no-splitting whitespaceNames: New York; Collocations: pick up
What’s a word?
Sentence SegmentationSimilar issues
Proposal: Split on ‘.’
Problems?
Sentence SegmentationSimilar issues
Proposal: Split on ‘.’
Problems?Other punctuation: ?,!,etcNon-boundary periods:
1.23Mr. SherwordM.p.g.….
Solutions?
Sentence SegmentationSimilar issues
Proposal: Split on ‘.’
Problems? Other punctuation: ?,!,etc Non-boundary periods:
1.23Mr. SherwordM.p.g.….
Solutions? Rules, dictionaries esp. abbreviations Marked up text + machine learning
Homework #1
General Notes and Naming
All code must run on the CL cluster under Condor
Each programming question needs a corresponding condor command file named:E.g. hw1_q1.cmdAnd should run with: $ condor_submit hw1_q1.cmd
General comments should appear in hw1.(pdf|txt)
Your code may be tested on new data
Q1-Q2: Building and Testing a Rule-Based English
TokenizerQ1: eng_tokenize.*
Condor file must contain Input = <input_file> Output = <output_file> You may use additional arguments ith input line should correspond to the ith output line Don’t waste too much time making it perfect!!
Q2: Explain how your tokenizer handles different
phenomena – numbers, hyphenated words, etc – and identify remaining problems.
Q3: make_vocab.*Given the input text:
The teacher bought the bike. The bike is expensive.
The output should be:The 2
the 1
teacher 1
bought 1
bike. 1
bike 1
is 1
expensive. 1
Q4: Investigating Tokenization
Run programs from Q1, Q3
Compare vocabularies with and without tokenization
Next TimeProbability
Formal languages
Finite-state automata