regular expressions (re) used for specifying text search strings. standarized and used widely (unix:...
Post on 18-Dec-2015
242 views
TRANSCRIPT
![Page 1: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/1.jpg)
Regular Expressions (RE)
• Used for specifying text search strings.
• Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
• A RE is a notation for characterizing a set of strings. Formally a language is defined as a (possibly infinite) set of strings of a given alphabet.
• A regular expression search consists of a search pattern and a text to search through.
![Page 2: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/2.jpg)
Basic RE Patterns
• E.g /woodchuck/• Case sensitive /Woodchuck/ not the same as /woodchuck/• Disjunction /[Ww]oodchuck/ : Woodchuck or woodchuck• Ranges
– /[A-Z]/ : [ABCDEFGHIJKLMNOPQRSTUVWXYZ]– /[0-9]/ : [0123456789]
• Negation – [^a] : anything that is not an “a”– [^A-Z] : anything that is not an uppercase letter– But: [a^b] : the pattern “a^b”
![Page 3: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/3.jpg)
Basic RE Patterns
• Optional characters
– /woodchucks?/ : woodchuck or woodchucks
• Zero or more instances (Kleene star)
– /baa*!/ : ba! or baa! or baaa! or baaaa! …
– /c[ab]*c/ : cabababc or caaaac or cc …
– Note: /a*/ matches everything.
• One or more instances
– /ba+!/ : ba! or baa! or baaa! or baaaa! …
– /[0-9]+/: A string of digits.
![Page 4: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/4.jpg)
Basic RE Patterns
• Wildcards: /./ matches any character
– /beg.n/ : begin, begun, beg_n…
• Anchors:
– Pattern at beginning of string: /^the car/ matches “the car I drive” but not “I drive the car”
– Pattern at end of string: /the car$/ matches “I drive the car” but not “the car I drive”
– \b matches a word boundary: /\bthe\b/ matches “the” but not “other”
![Page 5: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/5.jpg)
Basic RE Patterns
• Parentheses: (abc)+ matches abc, abcabc, abcabcabc ...
• Disjunction: /cit(y|ies)/ matches city or cities
• Repetitions: /(abc){3}/ matches abcabcabc
• Backslash: Used for escaping special characters.
– \*, \+, \., \? ...
• Aliases
– \n: newline, \t:tab, \d:[0-9], \w:[a-zA-Z0-9 ]
![Page 6: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/6.jpg)
RE Substitution
• s/regexp1/regexp2/ E.g. s/colour/color/
• Back references: \1, \2, \3 …
– s/([0-9]+)/<\1>/ : the 35 boxes -> the <35> boxes
– s/^\s*(\w+)\W+(\w+)/\2 \1/ : reverses the first two words of a sentence.
– Also used in search REs
• /A [a-z]+ is a \1/ : matches “A car is a car”.
![Page 7: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/7.jpg)
ELIZA
• Simulated the responses of a psychologist based on simple pattern substitution.
• Initially it cascades through a set of RE substitutions that change for example s/I’m/YOU ARE/, s/my/YOUR/ ...
• Then it runs the input through RE substitutions looking for relevant patterns and produces the appropriate output. e.g.
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR THAT YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1\?/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
![Page 8: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/8.jpg)
Finite State Automata (FSA)
• REs (that don’t use back-references) can be implemented as finite-state automata.
• A FSA is described by a regular expression.
• A RE or a FSA can be used to describe a class of languages called Regular Languages (RL).
![Page 9: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/9.jpg)
Finite State Automata
• A FSA is represented as a graph with a finite set of nodes (called states) and directed arcs between pairs of states (called transition) labeled with symbols from the alphabet.
• One state is a start state, represented by an incoming arrow.
• Some states are final or accepting states represented by a double circle.
![Page 10: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/10.jpg)
FSA Example
Sheeptalk: baa! baaa! baaaa! baaaaa! …
Equivalent to RE: /baaa*!/
![Page 11: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/11.jpg)
FSA Recognition
Examples:
baaa! Succeeds
aba!b Fails
![Page 12: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/12.jpg)
FSA State Transition Table
• Alternative representation for FSA
![Page 13: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/13.jpg)
FSA Example
![Page 14: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/14.jpg)
Formal FSA Definition
• Q: a finite set of states. (q0, q1, q2, …)• Σ: a finite input alphabet of symbols• q0: the start state (first state) • F: the states with of final states (subset of Q)• δ(q,i): the transition function from states and inputs to
states. Given a state q and an input i, it returns a new state q’.
Deterministic FSA (DFSA). The recognition of a string has no choice points.
![Page 15: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/15.jpg)
Non Deterministic FSA (NFSA)
• When in state q2 with input a, the FSA has the choice to move to state q3 or remain in state q2.
![Page 16: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/16.jpg)
Empty Arcs
From state q3 the FSA can move to state q2, without looking at the input (without advancing the tape).
![Page 17: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/17.jpg)
NFSA Transition Tables
An extra ε column is added.
The transitions are now sets of states (instead of single states)
![Page 18: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/18.jpg)
Accepting Strings with NFSA
• Since there is a choice of which arc to follow it is possible to take the wrong path and reject a string that should be accepted.
• All possible paths should be followed and if even one reaches a final state then the string is accepted.
• Computational approaches– Backup: When we store the current search-state (the state of the
FSA and the position of the tape) and when we reach dead end we back up to that search-state and try another path from there.
– Lookahead: We look ahead in the input to decide which path to take.
– Parallelism: Alternative paths are explored in parallel.
![Page 19: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/19.jpg)
NFSA Recognition as Search
• The NFSA recognition can be seen as a search through a space of search-states. This consists of all the possible pairings of FSA-states and tape positions.
• The order that these search-states are visited (i.e. the decision about which possible path to follow) is important for performance.
• Depth-first or breadth-first search.
• For larger search spaces it may be necessary to use more complex search tehniques (e.g Dynamic programming or A*).
![Page 20: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/20.jpg)
![Page 21: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/21.jpg)
Relating DFSA and NFSA
• For every NFSA there exists an equivalent DFSA (i.e. that accepts exactly the same set of strings).
• The idea behind the proof is based on converting a NFSA to an equivalent DFSA. The resulting DFSA, may have many more states than the original NFSA (up to 2N states for a NFSA with N states).
![Page 22: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/22.jpg)
Morphological Parsing and Recognition
• Morphological recognition: Accepts and rejects forms:
– Accept: geese
– Reject: gooses
• Morphological parsing: produces a morphological analysis (stem followed by morphological features)
– geese: goose + N + PL
– cats: cat + N + PL
– ground: ground +N +SG, grind +V +PPart
![Page 23: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/23.jpg)
Morphological Parsing
• A morphological parser is composed of
– lexicon: the list of stems or affixes in a language, together this basic information about them.
– morphotactics: model of morpheme ordering, that defines which morpheme classes may follow other classes.
– orthographic rules: spelling rules used to model changes that occur in the language (e.g. city+s -> cities)
![Page 24: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/24.jpg)
Lexicon
• A repository of words:
a, AAA, AA, Aachen, aardvark, aardwolf...
• Not practical to list every word in the language. Impossible for some languages (e.g. Finnish, Turkish...) Usually only the stems and the affixes are listed.
• Ideally every word possible word (or stem) should be in the lexicon, including abbreviations and proper names.
• Often along with stems in the lexicon we keep information about stem classes.
– e.g. dog: reg-noun, goose: irreg-sg-noun,
– geese: irreg-pl-noun, -s: plural-suffix
![Page 25: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/25.jpg)
Morphotactics• Commonly represented as a FSA.
• e.g. Simple FSA for plural formation in English
![Page 26: Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d245503460f949fb23b/html5/thumbnails/26.jpg)
Morphotactics• In cases where a morphological process is more complicated, or not
fully productive (unhappy, unreal but *unbig, *unred) the morphotactics FSA, may become quite complicated and many different stem classes may be necessary.