finite state languagescse 140 - intro to cognitive science1 the computational modeling of language:...
DESCRIPTION
Finite State LanguagesCSE Intro to Cognitive Science3 A Formal Theory We will present a mathematical theory of language. Because of time constraints we will be somewhat informal in introducing concepts, but EVERYTHING we present can be made completely rigorous, starting from definitions and proceeding through proofs. Strategy: First, examples from English Then, more abstract examplesTRANSCRIPT
Finite State Languages
CSE 140 - Intro to Cognitive Science
1
The Computational Modeling of Language: Finite State Languages
Lecture I: Slides 1-21Lecture 2: Slides 22-…
Finite State Languages
CSE 140 - Intro to Cognitive Science
2
Language Exists at Many Levels• Sounds• Words• Sentences (utterances)• Discourse (text)• Dialog• Combined with other modalities• Etc.
We will focus on a formal account of the sentence level• Provides formal account of grammaticality
judgments• Simple yet powerful models
Finite State Languages
CSE 140 - Intro to Cognitive Science
3
A Formal Theory We will present a mathematical theory of language.
Because of time constraints we will be somewhat informal in introducing concepts, but EVERYTHING we present can be made completely rigorous, starting from definitions and proceeding through proofs.
Strategy: • First, examples from English• Then, more abstract examples
Finite State Languages
CSE 140 - Intro to Cognitive Science
4
The Set of Strings over an Alphabet Given a finite alphabet, , the set of strings over will be denoted by *, including the null string
Let = { all words of English}• Then * denotes all strings of words of English,
including the empty (null) string • Only some of these strings are grammatical• sentences of English
Let = {a, b}. Then * denotes all strings of a's and b's, including the empty (null) string
Finite State Languages
CSE 140 - Intro to Cognitive Science
5
A Language over A language L over is a subset of *
Let LE be the set of all grammatical sentences of English
• LE * is a language over = { all words of English}
Sentences in LE:John likes apples Apples like JohnTwo is greater than four The black cat is on the mat….
(Notation:L * means L is a subset of * )
Finite State Languages
CSE 140 - Intro to Cognitive Science
6
Sentences not in LE
The the theJohn like peanutsEvery student hates any courseThe rat the cat the dog chased bit ate the cheese (?)etc.
Finite State Languages
CSE 140 - Intro to Cognitive Science
7
Another Language Over Another L = { am bn | m 1,n 2} = Set of all strings of a's and b's such that
• All a's precede all b's and • There is at least one a and • There are at least two b’s
- L is a language over {a, b}*
(Notation: a2 means aa, a2b3 means aabbb)(Notation {x|y} means the set of all xs such that condition
y is true of those xs)
Finite State Languages
CSE 140 - Intro to Cognitive Science
8
Infinite Language from Finite ModelsA language over can be finite or infinite • LE: the set of all grammatical sentences of
English• LE is potentially infinite
Finite characterization of a potentially infinite set can often be alternatively modeled by:
• grammar characterization• machine characterization• behavioral characterization
Finite State Languages
CSE 140 - Intro to Cognitive Science
9
Road Map to the Reading!We will begin with Chapters 17 and 18, returning to the
general characterization of grammars (Sections 16.4 and 16.5 ) later. (Skip Section 16.3 )
Chapter 17:• machine characterization of languages with finite state
machines• equivalent grammar characterization• equivalent ‘behavioral’ characterization
• in terms of terminal symbols only• regular expressions
Finite State Languages
CSE 140 - Intro to Cognitive Science
10
Introduction to Finite State AutomataFinite State Automata (FSAs) are characterized by: States (circles), including initial and final states A vocabulary (here {the, a, big, very, book, poor}) Transitions between states (arrows)An FSA accepts a language L
q0the
q1 book
thepoor
aq2
q3 q4
big
very
Finite State Languages
CSE 140 - Intro to Cognitive Science
11
Another Finite State Automaton States: K = {q0, q1}
• Initial state: Q0• Final states: F = {q1}
A vocabulary a, b
(Back to English on Slide 27…)
qo q1
ab
b
Finite State Languages
CSE 140 - Intro to Cognitive Science
12
Another Finite State Automaton II
The arrows are a graphical representation of the transition function:
q0 , a q0 q0, b q1 q1, b q1 qo q1
ab
b
Finite State Languages
CSE 140 - Intro to Cognitive Science
13
A Formal Definition of an FSA MWe can characterize a languages L by FSA M = ( K, qo , F) where• K is the finite set of states• is the finite input alphabet • qo K is the initial state• F K is the set of final states• is the transition function
• for each state and each input symbol, specifies the next state of the machine
Notation: a means a is an element of the set A.)
Finite State Languages
CSE 140 - Intro to Cognitive Science
14
The Language Accepted by an FSAGiven a FSA, M, the language (i.e., the set of
strings accepted by M) is defined as follows:
L(M) = { w | • w * and• starting with q0 and• following the transitions as specified by ,
• M reaches one of the final states}
Finite State Languages
CSE 140 - Intro to Cognitive Science
15
Definition of Finite State Language
A language L is a finite state language (fsl) or a regular language if there is a FSA, M such thatthe language of M is L.
Finite State Languages
CSE 140 - Intro to Cognitive Science
16
The Language Accepted by our FSA?L = any number of a’s (including none) followed
by at least one b
= { an bm | n 0, m 1}
= a* b b* (* here means any number of repetitions including none)
qo q1
ab
b
Finite State Languages
CSE 140 - Intro to Cognitive Science
17
Let’s Do an Example!Let = {0, 1}.Let L = the set of all strings of 0’s and 1’s that
contain exactly two 1’s.Show that L is a finite state language!
First step: L is a finite state language if….
Second step: Define such an M
(For other examples, see Exercise 3, Ch. 17)
Finite State Languages
CSE 140 - Intro to Cognitive Science
18
But Isn’t a Function in Our Example!So far: = {a,b}q0 , a q0 q0, b q1 q1, b q1
q1, a ?
qo q1
ab
b
Finite State Languages
CSE 140 - Intro to Cognitive Science
19
But Isn’t a Function in Our Example!So far: = {a,b}q0 , a q0 q0, b q1 q1, b q1
q1, a ?
Needed: a dead state
qo q1
ab
b
Finite State Languages
CSE 140 - Intro to Cognitive Science
20
A Fully Specified FSA with Dead States = {a,b}q0, a q0q0, b q1q1, b q1q1, a q2q2, a q2q2, b q2
qo q1
ab
bq2
a
a b
Finite State Languages
CSE 140 - Intro to Cognitive Science
21
A Taste of FSA Algebra: ComplementsDefinition: The complement of L = - L
i.e. the set of strings in not contained in L
To find the complement of an FSL L:
1. Find fully specified FSA M such that L = L(M)2. Switch the final and non-final states!
So the complements of FSLs are FSLs!
Finite State Languages
CSE 140 - Intro to Cognitive Science
22
Complements: An ExampleLet = {0, 1}.Let L = {an bm | n 0, m 1} (our old friend)
What’s the complement of L??
Finite State Languages
CSE 140 - Intro to Cognitive Science
23
1. Find … FSA M such that L = L(M)
= {a,b}q0, a q0q0, b q1q1, b q1q1, a q2q2, a q2q2, b q2
qo q1
ab
bq2
a
a b
Finite State Languages
CSE 140 - Intro to Cognitive Science
24
2. Switch Final and Non-final States
= {a,b}q0, a q0q0, b q1q1, b q1q1, a q2q2, a q2q2, b q2
qo q1
ab
bq2
a
a b
Finite State Languages
CSE 140 - Intro to Cognitive Science
25
Definition: A Deterministic FSA
= {a,b}q0, a q0q0, b q1q1, b q1q1, a q2q2, a q2q2, b q2
qo q1
ab
bq2
a
a b
An FSA M is deterministic if is a function, i.e. for each state and each input there is exactly one new state. The FSA’s we have considered since slide 11 are all deterministic FSAs (DFAs).
Finite State Languages
CSE 140 - Intro to Cognitive Science
26
Non-deterministic Finite AutomataIn a non-deterministic FSA (NFA), the transition
relation allows any number of new states for each state and each input.
We will also allow transitions on no input (i.e., on the null string).
A string w is accepted by a non-deterministic FSA if there is at least one state sequence (starting with the initial state) that will reach one of the final states.
(Notation: is the upper case version of
Finite State Languages
CSE 140 - Intro to Cognitive Science
27
A Non-deterministic FSA for English…
Simple noun phrases of English containing • a determiner (DET) followed by a noun (N) the cat• DET followed by an adjective (ADJ) the poor• N only peanuts
q0DET
q1 N
DETADJ
q2
q3 q4
Finite State Languages
CSE 140 - Intro to Cognitive Science
28
A Surprise!While NFAs are often convenient to use, it turns
out that:For every FSA M such that M is non-
deterministic there is a simple algorithm which will construct a FSA M’ such that • M’ is deterministic and • L(M’) = L(M)
(If M’ accepts exactly the strings accepted by M, we say M’ is equivalent to M.)
So NFAs are no more powerful than DFAs!!
Finite State Languages
CSE 140 - Intro to Cognitive Science
29
An Equivalent NFA and DFA
An NFA:
An equivalent DFA:
q0 DET q1 N
DET ADJ
q2
q3 q4
q’0 DET q’1 N q’2
q’4
q’5N
ADJ
Finite State Languages
CSE 140 - Intro to Cognitive Science
30
Back to Noun Phrases: ADJs
How can we add:optional adjectives before the noun? the black cat, the beautiful black cat
q0DET
q1 N
DETADJ
q2
q3 q4
Finite State Languages
CSE 140 - Intro to Cognitive Science
31
More About Noun Phrases: ADJs
To add optional adjectives before the noun, add to q1, ADJ q1
the black cat, the beautiful black cat
q0DET
q1 N
DETADJ
q2
q3 q4
ADJ
Finite State Languages
CSE 140 - Intro to Cognitive Science
32
More About Noun Phrases: ADVs
How can we add: optional adverbs (ADV) on adjectives?
the very old, the very very old
q0DET
q1 N
DETADJ
q2
q3 q4
ADJ
Finite State Languages
CSE 140 - Intro to Cognitive Science
33
More About Noun Phrases: ADVs
To add optional adverbs (ADV) on adjectives, add to q3, ADV q3
the very old, the very very old
q0DET
q1 N
DETADJ
q2
q3 q4
ADJ
ADV
Finite State Languages
CSE 140 - Intro to Cognitive Science
34
Bug! What about the very old cat??
We need to allow optional ADVs before the ADJ from q1 as well…
q0DET
q1 N
DETADJ
q2
q3 q4
ADJ
ADV
Finite State Languages
CSE 140 - Intro to Cognitive Science
35
Consistently adding ADVs before ADJs
Why did we need to add an extra state, q5??
q0DET
q1 N
DETADJ
q2
q3 q4ADV
q5
ADV
ADJ
Finite State Languages
CSE 140 - Intro to Cognitive Science
36
Adding Prepositional Phrase ModifiersA prepositional phrase (PP) consists of • a preposition (P) like in, on, above, for, near• followed by a noun phrase
on the dirty old mat in the very old boxon the mantle for the very poor
PPs can also modify NPsthe black cat on the dirty old matthe very old box on the mantle
Finite State Languages
CSE 140 - Intro to Cognitive Science
37
Extending our NFA for PP modifiers
The catThe cat on the matThe cat on the mat by the door in the back
q0DET
q1 N
DETADJ
q2
q3 q4ADV
q5
P P
ADV
ADJ
Finite State Languages
CSE 140 - Intro to Cognitive Science
38
An NFA for Simple Sentences
The dog chased the catThe young admire the oldFoxes eat chickens
Looks promising…..
q0DET
q1 N
DETADJ
q2
q3 q4
q5DET
q6 N
DETADJ
q7
q8 q9
V
V
Finite State Languages
CSE 140 - Intro to Cognitive Science
39
An NFA for Less Simple Sentences
The very old man watched young brown puppiesThe very very poor want a good educationThe young puppies in the old brown box watched the cat in the cornerLooks even better….. BUT….
q0DET
q1 N
DETADJ
q2
q3 q4ADV
q5
ADV
ADJq6
DETq7 N
DETADJ
q8
q9q10
ADV
q11
ADV
ADJV
V
Finite State Languages
CSE 140 - Intro to Cognitive Science
40
Bug: The NFA “loses generalizations”
For these sentences, the Subject NP states and Object NP states are duplicates…
What would a FSA for “NP gave NP to NP” look like? The FSA model loses the generalization that NPs are
NPs are NPs..
q0DET
q1 N
DETADJ
q2
q3 q4ADV
q5
ADV
ADJq6
DETq7 N
DETADJ
q8
q9q10
ADV
q11
ADV
ADJV
V
Finite State Languages
CSE 140 - Intro to Cognitive Science
41
Another FSA Bug: (17.3.2)More serious trouble:
The cat died. (NP V)The cat the dog chased died. (NP NP V V)The cat the dog the rat bit chased died. (NP3V3)The cat the dog the rat the elephant admired bit.
chased died. (NP4V4)
These are all of the form NPnVn
FSAs can’t generate these, as we’ll see next…
Finite State Languages
CSE 140 - Intro to Cognitive Science
42
A Language That Is Not an FSLConsider L = {an bn | n 1}
i.e. L consists of all strings where• There are an equal number of a’s and b’s, • All a’s precede all b’s.
L is not a fsl.
FSA’s cannot count up to an arbitrary number!So English isn’t an fsl!!
Finite State Languages
CSE 140 - Intro to Cognitive Science
43
The Pumping Lemma for fsl’s (17.2.1)If L is fsl (regular) then for all sufficiently long
strings w L we have the following property:
• w = x u y i.e. w can be segmented into three parts, which we’ll call x, u, and y
• all strings of the form x ui y L, (where ui means i copies of u, i 0)
q0 q1 q3x y
The loop involving u may include several states.
u
Finite State Languages
CSE 140 - Intro to Cognitive Science
44
Showing L = {anbn | n1} is not a fsl: The Pumping Lemma:
If L is fsl then for all sufficiently long strings w L:
• w = x u y• all strings of the form x ui y L.
1. Try locating the u segment in various places in the string a a ... b b ..
2. In each case the string obtained by iterating u is not in L.
3. Hence, L is not a fsl.
Finite State Languages
CSE 140 - Intro to Cognitive Science
45
Characterizing fsl’s Using Grammars Finite State Grammars aka Type 3 Grammars aka Right Linear Grammars
The languages generated by right linear grammars are exactly the grammars accepted by FSAs.
Finite State Languages
CSE 140 - Intro to Cognitive Science
46
An Informal Intro to FSGs
S John AA likes BB roasted CC peanuts Derivation:
S John A (= VP) likes B (= NP) roasted C (= N) peanuts
Finite State Languages
CSE 140 - Intro to Cognitive Science
47
Finite State GrammarsA finite state grammar G = ( VT, VN, S, R) consists of• VT the terminal vocabulary • VN the non-terminal vocabulary • S the start symbol, S VN
• R a finite set of rewrite rules (productions)
The rewrite rules are of the following formA a B where A, B VN and aVT
A a
Finite State Languages
CSE 140 - Intro to Cognitive Science
48
An Example FSGG = ( VT, VN, S, R) where
• VT = {John, roasted, peanuts, likes}• VN = {S, A, B, C}• R =
{S John AA likes BB roasted CC peanuts}
Finite State Languages
CSE 140 - Intro to Cognitive Science
49
DerivationsDerivation starts with S.
Since the right hand side of a rule has at most one non-terminal there is only one non-terminal (if any) that can be rewritten at each step.
Derivation stops when there no more non-terminals to be rewritten.
L(G)= language derived by G= set of all strings of terminal strings derived in G starting from S.
Finite State Languages
CSE 140 - Intro to Cognitive Science
50
A Derivation in our Example FSGG = ( VT, VN, S, R) where
• VT = {John, roasted, peanuts, likes}• VN = {S, A, B, C}• R =
{S John AA likes BB roasted CC peanuts}
Derivation: S John A (= VP) likes B (= NP) roasted C (= N) peanuts
Finite State Languages
CSE 140 - Intro to Cognitive Science
51
The Equivalence of FSGs and fsa’sWe can construct an FSG G given an FSA M:
1. Treat the states of M as the non-terminals (treat K as VN ).
2. Treat the vocabulary of M as the terminals. (treate as VT).
3. For transition from state A to state B on input symbol a create a rule A a.
4. For a transition from state A to a final state of M on the input symbol a corresponds to the rule A a.
Finite State Languages
CSE 140 - Intro to Cognitive Science
52
More to come….