regular expressions for nlp
DESCRIPTION
Details how to use regular expressions for use for natural language processingTRANSCRIPT
Regular Expressions & Finite State AutomataLecture 1
What is a Regular Expression
• Notation for specifying set of strings• Used for search• Corpus: text(s) to search through / learn from
• Used to define (formal) language
Creating a Regular Expression
• Perl notation uses / / around regexes• Expressions composed of:
Category Symbols Example Example Matches
Literal Characters
/the/ the, other, The
Character Sets . [ ] \d \D \w \W \s \S
/[a-zA-Z]/ A, a, t, S, Z, ab
Disjunction | /T|the/ The, the
Boundaries \b \B ^ $ \n \t /\bthe\b/ the, other, the.
Quantifiers * + ? { } /colou?r/ color, colour
Special Characters
\ /.+\.com/ Yahoo.com
Capturing ( ) \1 /(\d{5}).+\1/ Same zip twice
Creating a Regular Expression
• Defining a regex involves iteratively improving:• Accuracy/Precision: minimizing false positives• e.g. /the/ /\bthe\b/
• Coverage/Recall: minimizing false negatives• e.g. /the/ /T|the/
Using Regular Expressions
• Generally used to search or replace:• Perl:$str = “other people”if($str =~ /the/) …
• Java:import java.util.regex.*;…Pattern r = Pattern.compile(“\d”);Matcher m = r.matcher(“D0es th1s c0nta1n d1g1ts?”);if(m.find()) …
• Python:import researchObj = re.search(r‘the’, “other people”)phone = “Tel: 209-867-5309”re.sub(r‘\d’, ‘#’, phone)
References
• Good tutorials and cheat sheets available online:• http://regexone.com/lesson• http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf • http://donovanh.com/pages/regex_list.html
• Textbook also has cheat sheet on cover
ELIZA (1966)
• Cascading regexes to simulate Rogerian psychologist• Available online: http://nlp-addiction.com/eliza/ • Embodiment of Searle’s “Chinese Room”
ELIZA
• Cascading regexes to simulate Rogerian psychologist• s/I’m/YOU ARE/• s/M|my/YOUR/
ELIZA
• Cascading regexes to simulate Rogerian psychologist• s/YOU ARE (depressed|sad)/I AM SORRY TO HEAR YOU ARE \1/• s/YOU ARE (depressed|sad)/WHY DO YOU THINK THAT YOU ARE\1/
ELIZA
• Cascading regexes to simulate Rogerian psychologist• s/\ball\b/IN WHAT WAY/• s/\balways\b/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Finite State Automata
Finite State Automata (FSAs)
• Regular Expressions are convenient way to describe an FSA:• Sheep language: /baa+!/
• FSAs and probabilistic cousins (Markov models) are used extensively in NLP.• Perfectly capture regular languages• Capture parts of natural languages: phonology, morphology, syntax.
FSA representation
• States are represented by circles• Q0 or state with incoming arrow: start state
• Double circled states: final/accepting state• Directed links: transitions between states
• Imagine tape with input – try to match to transition:
Formal Representation
• Specify the following:• Q = {q0,q1,…qn-1} a finite set of N states
• Σ a finite input alphabet of symbols
(symbols can have internal structure)• q0 the start state
• F the set of final states F ⊂ Q• δ (q,i) a transition function that maps QxΣ
to Q
Transition Table
• Convenient for computer representation, too: Input
State b a !
0 1 ∅ ∅1 ∅ 2 ∅2 ∅ 3 ∅3 ∅ 3 4
4 ∅ ∅ ∅
D-Recognize
• Deterministic: no choice points
Generative Uses
• Any model that recognizes a formal language (FSA, regex, CFG) can be used to generate valid strings.• Starting in q0, select random transitions until reach final state.
Non-Deterministic FSAs
• More than one transition possible for a particular state and input combination:
• Or uses epsilon transitions, where no input characters are read:
Non-Deterministic FSAs
• In NFSA there exists at least one path through the machine for any string in the language defined by the machine.
• Not all paths directed through the machine for an acceptable string lead to an accept state.
• No paths through the machine lead to an accept state for a string not in the language.
• Challenge: what to do if make wrong transition choice?
Resolving Non-Determinism
• Backup: when reach a choice point, mark state and input position (search-state), then if needed roll backwards.
• Look-Ahead: Look at following input symbols to try to choose correct transition.
• Parallelism: Follow each of the transition options in parallel.
• Convert: All NFSAs can be converted to an equivalent FSA.
Backup
• Need to modify transition table:• Add epsilon transition column• Allow multiple destination states for given search-state.
Input
State
b a ! ε0 1 ∅ ∅ ∅1 ∅ 2 ∅ ∅2 ∅ 2,3 ∅ ∅3 ∅ ∅ 4 ∅4 ∅ ∅ ∅ ∅
Input
State
b a ! ε0 1 ∅ ∅ ∅1 ∅ 2 ∅ ∅2 ∅ 3 ∅ ∅3 ∅ ∅ 4 3
4 ∅ ∅ ∅ ∅
NFSA Search: BFS or DFS
Keep a stack or queue of search-states remaining to
explore.
Computing Theory …
• You may recall from (or learn in) COMP 147:
Class of languages definable by regular expressions is same as class definable by FSAs. These are called regular languages.
Your Turn …
• Lab 1: Regular Expression Practice
• Project 1: ELIZA reborn