finite automata and regular expressions i206 fall 2010 john chuang some slides adapted from marti...

21
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

Finite Automata and Regular Expressions

i206 Fall 2010

John ChuangSome slides adapted from Marti Hearst

Page 2: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 2

QuickTime™ and a decompressor

are needed to see this picture.

http

://x

kcd.

com

/208

/

Page 3: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 3

What is a Computer?

Theory of computation:- What are the fundamental capabilities and limitations of computers?

- Complexity theory: what makes some problems computationally hard and others easy?

- Computability theory: what makes some problems solvable and others unsolvable?

- Automata theory: definitions and properties of mathematical models of computation

Source: Sipser, 1997

Page 4: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 4

Finite Automata

Also known as finite state automata (FSA) or finite state machine (FSM)

A simple mathematical model of a computer

Applications: hardware design, compiler design, text processing

Page 5: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 5

A First Example

Touch-less hand-dryer

hand

hand hand

DRYEROFF

DRYERON

hand

Page 6: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 6

Finite Automata State Graphs

• The start state

• An accepting state

• A transitionx

• A state

Page 7: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 7

Finite Automata

Transition:s1 x s2

- In state s1 on input “x” go to state s2

If end of input- If in accepting state => accept- Else => reject

If no transition possible => reject

Page 8: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 8

Language of a FA

Language of finite automaton M: set of all strings accepted by M

Example:

Which of the following are in the language?- x, tmp2, 123, a?, 2apples

A language is called a regular language if it is recognized by some finite automaton

letterletter | digit

S A

Page 9: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 9

Example

What is the language of this FA?

+

digit

S

B

A-

digit

digit

Page 10: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 10

Regular Expressions

Regular expressions are used to describe regular languages

Arithmetic expression example: (8+2)*3

Regular expression example: (\+|-)?[0-9]+

Page 11: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 11

Example

What is the language of this FA? Regular expression: (\+|-)?[0-9]+

+

digit

S

B

A-

digit

digit

Page 12: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 12

Three Equivalent Representations

Finite automata

Regularexpressions

Regular languages

Each can

describethe others

Theorem:

For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa.

Adapted from Jurafsky & Martin 2000

Page 13: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 13

Regex Rules

Goal: match patterns // String of characters matches the same string

woodchuck “how much wood does a woodchuck chuck?”

p “you are a good programmer”

pink elephant “this is not a pink elephant.”! “Keep you hands to yourself!”

. Wildcard; matches any character at that positionp.nt pant, pint, punt

Page 14: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 14

Regex Rules ? Zero or one occurrences of the preceding

character/regexwoodchucks? “how much wood does a woodchuck chuck?”behaviou?r “behaviour is the British spelling of

behavior”

* Zero or more occurrences of the preceding character/regexbaa* ba, baa, baaa, baaaa …ba* b, ba, baa, baaa, baaaa …

[ab]* , a, b, ab, ba, baaa, aaabbb, …[0-9][0-9]* any positive integer, or zerocat.*cat A string where “cat” appears twice anywhere

+ One or more occurrences of the preceding character/regexba+ ba, baa, baaa, baaaa …

{n} Exactly n occurrences of the preceding character/regexba{3} baaa

Page 15: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 15

Regex Rules * is greedy:

<.*> <a href=“index.html”>Home</a>

Lazy (non-greedy) quantifier:<.*?> <a href=“index.html”>Home</a>

Page 16: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 16

Regex Rules

[] Disjunction (Union)[wW]ood “how much wood does a Woodchuck chuck?”

[abcd]* “you are a good programmer”[A-Za-z0-9] (any letter or digit)[A-Za-z]* (any letter sequence)

| Disjunction (Union)(cats?|dogs?)+ “It’s raining cats and a dog.”

( ) Grouping(gupp(y|ies))* “His guppy is the king of guppies.”

Page 17: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 17

Regex Rules ^ $ \b Anchors (start/end of input string; word boundary)^The “The cat in the hat.”^The end\.$ “The end.”^The .* end\.$ “The bitter end.”(the)* “I saw him the other day.”(\bthe\b)* “I saw him the other day.”

Special rule: when ^ is FIRST WITHIN BRACKETS it means NOT[^A-Z]* (anything not an upper case letter)

Page 18: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 18

Regex Rules \ Escape characters

\. “The + and \ characters are missing.”

\+ “The + and \ characters are missing.”

\\ “The + and \ characters are missing.”

Page 19: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 19

Operator Precedence

What is the difference? - [a-z][a-z]|[0-9]*- [a-z]([a-z]|[0-9])*

Operator Precedence

() highest

* + ? {}

sequences, anchors

| lowest

Page 20: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 20

Regex for Dollars

No commas\$[0-9]+(\.[0-9][0-9])?

With commas\$[0-9][0-9]?[0-9]?(,[0-9][0-9][0-9])*(\.[0-9][0-9])?

With or without commas \$[0-9][0-9]?[0-9]?((,[0-9][0-9][0-9])*| [0-9]*) (\.[0-9][0-9])?

Page 21: Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst

John Chuang 21

Regex in Python

import re

result = re.search(pattern, string) result = re.findall(pattern, string) result = re.match(pattern, string)

Python documentation on regular expressions - http://docs.python.org/library/re.html- Some useful flags like IGNORECASE, MULTILINE, DOTALL, VERBOSE