contentslinguistics.ucla.edu/people/stabler/236.pdf · 2017-03-14 · volume b. salomaa 1973 formal...

Notes on computational phonology

E. Stabler

UCLA, Spring 1999

Contents

1 Preface 5

2 Finite recognizers of languages 6

3 Some early proposals 19

4 Using non-deterministic machines 34

5 One level phonology 41

6 Optimality theory: first ideas 62

7 OTP: Primitive optimality theory 76

8 Lenient compositions: the proper treatment of OT? 86

9 Acquisition models 91

10 Exercises and speculations 113

1

Stabler - Lx 236 1999

A web page of readings:

236: some readings

the beauty of finite state machines and related topics

Yu 1997 Regular languages. In Rozenberg & Salomaa, eds. Handbook of Formal Languages,Volume 1. Perrin 1990 Finite automata. In J. Van Leuwen, Handbook of Theoretical Computer Science,Volume B. Salomaa 1973 Formal Languages. Sec 5 Hopcroft and Ullman 1979 Introduction to Automata Theory, Languages, and Computation. Sec2,3 Watson 1994 A taxonomy of finite automata minimization algorithms (pdf) Mohri 1997 Finite state transducers in language and speech processing (pdf) CL 23: 269-312.

Dijkstra’s algorithm: dynamic programming for best paths Cormen , Leiserson & Rivest 1991 Single source shortest paths. Introduction to Algorithms. Dijkstra’s algorithm demo another Dijkstra’s algorithm demo

Other beautiful things: Berstel 1979 Transductions and context-free languages. Roche & Schabes 1997 Introduction. (pdf) Finite State Language Processing. Géczeg & Steinby 1997 Tree languages. In Rozenberg & Salomaa, eds. Handbook of FormalLanguages, Volume 3. Béal & Perrin 1997 Symbolic dynamics and finite automata (pdf) In Rozenberg & Salomaa, eds.Handbook of Formal Languages, Volume 2.

phonology: models

multi-stratal language models Kaplan & Kay 1994 Regular models of phonological rule systems. ComputationalLinguistics, 20: 331-378. Karttunen 1991 Finite state constraints Karttunen 1997 The proper treatment of optimality in computational phonology (pdf) Karttunen, Kaplan, Zaenen, 1992 Two-level morphology with composition Karttunen 1992 Two level rule compiler Kiraz & Grimley-Evans 1997 Multi-Tape Automata for Speech and Language Systems: AProlog Implementation. (pdf) In D. Wood & S. Yu (eds.), Automata Implementation, LectureNotes in Computer Science 1436, Springer, 1998.

Bird, Ellison:

2


Bird Coleman, Pierrehumbert & Scobbie 1992 Declarative phonology (pdf) Bird & Ellison 1994 One-level phonology: autosegmental representations and rules as finiteautomata. (pdf) Ellison 1994 Constraints, exceptions, and representations. Procs ACL SIGPHON FirstMeeting (pdf) Ellison 1994 Phonological derivation in optimality theory (pdf) Coling 94:1007-1013 (VolII)

Eisner et al: Eisner 1997 What constraints should OT allow? (pdf) LSA handout Eisner 1997 FootForm decomposed (pdf) Eisner 1997 Efficient generation in primitive optimality theory (pdf) Albro 1998 Three formal extensions to primitive optimality theory (pdf)

Smolensky & Tesar: Smolensky 1996 On the production/comprehension dilemma in child language (pdf) LI27:720-731. Smolensky 1996 The initial state and ’richness of the base’ in optimality theory (pdf) Tesar 1998 Robust Interpretive Parsing in Metrical Stress Theory (pdf)

Hale & Reiss 1998 Formal and empirical arguments concerning phonological acquisition (pdf)Linguistic Inquiry 29(4): 656-683 Frank & Satta 1997 Optimality theory and the generative complexity of constraint violability (pdf)Walther 1996 OT SIMPLE (pdf)

HMMs and weighted automata

Pereira & Riley 1996 Speech recognition by composition of weighted finite automata (pdf) Pereira & Saul 1996: Aggregate and mixed order Markov models for statistical languageprocessing (pdf)

acquisition

Ellison 1997 Simplicity, Psychological Plausibility and Connectionism in Language Acquisition(pdf) Ellison 1996 The universal constraint set: convention not fact Ellison 1994 The iterative learning of phonological rules (pdf) CL 20(3)

de Marcken 1996 Linguistic structure as composition and perturbation (pdf) de Marcken 1995 The unsupervised acquisition of a lexicon from continuous speech (pdf)

Tesar 1997 Multi-Recursive Constraint Demotion (pdf) Tesar & Smolensky 1996 Learnability in Optimality Theory (long version) (pdf)

Daelemans Berck,Gillis 1996 Unsupervised discovery of phonological categories throughsupervised learning of morphological rules. (pdf) COLING

Vitanyi & Li 1997 On prediction by data compression (pdf) Vitanyi & Li 1997 Minimum description length induction, Bayesianism, and Kolmogorov

3


complexity (pdf) Vitanyi & Li 1997 Ideal MDL and its relation to Bayesianism (pdf)

Grünwald 1996 A Minimum Description Length Approach to Grammar Inference (pdf) Grünwald 1996 The Minimum Description Length Principle and Non-Deductive Inference (pdf)

Vapnik 1998 Statistical Learning Theory.

more links

tools AT&T FSM Library van Noord’s FSA utitilities Graphviz (dot) Sicstus manual (local copy) Church’s unix text processing for poets

research papers, centers SIGPHON Edinburgh computational phonology archive Edinburgh computational phonology library (local mirror) Rutgers Optimality Archive - Home XRCE: Finite-State HomePage Haskins Gestural Model page JHU: Acoustic-phonetic feature detectors

more Church 1987 Phonological parsing in speech recognition. Kluwer Carson-Berndsen 1998 Time map phonology. Kluwer Boersma 1998 Functional phonology Kornai 1996 Vectorized finite state automata Kornai 1993 Relating phonetic and phonological categories Karttunen 1994 Constructing lexical transducers Apostolico 1997 String editing. In Rozenberg & Salomaa, eds. Handbook of FormalLanguages, Volume 2. Pereira & Wright 1996 Finite state approximation of phrase structure grammars Johnson 1997 FS approximations of constraint grammars

Edward StablerLast modified: Fri May 21 15:51:41 PDT 1999

4


1 Preface

These notes were prepared for a UCLA seminar on computational proposals in recent phonology. Verymany corrections and contributions were made by the seminar participants, especially Adam Albright,Dan Albro, Marco Baroni, Leston Buell, Bruce Hayes, Gianluca Storto, Siri Tuttle. Thanks also to EdKeenan for some corrections and suggestions. The notes are still rough (many typos are left, I’msure). I hope to improve them! They are intended to be an accompaniment to the literature, not areplacement; they presuppose an acquaintance with the original sources that are discussed.

One of the main traditions in computational phonology is based on finite state models of phono-logical constraints. This is perhaps surprising, since finite state models, at least at first blush, seemto be too strong and too weak. They seem too strong because phonological relations seem to be localfor the most part, in a way that dependencies in finite state languages are not. (For example, it is easyto define a finite state language with strings that have either a single a or b followed by any numberof c’s, followed by a repetition of the first symbol: (ac∗a) ∪ (bc∗b). The final symbol can dependon a symbol that occurred arbitrarily far back in the sequence.) And on the other hand, finite statemodels are too weak in the sense that some phenomena exhibit dependencies of a kind that cannotbe captured by these devices: notably, reduplication. These issues come up repeatedly in these notes.

These notes go slightly beyond what is already in the literature in only a couple of places. We areperhaps clearer about the one-level/two-level distinction in §§5.2,7.3 than the literature has been. Andrather than restricting attention to finite state compositions as is sometimes done, we take the perhapsless practical but scientifically more promising route of emphasizing the prospects for composingfinite state models with the grammars of larger abstract families of languages in §§7.4,10.2.

Formal, computational models are important in linguistics for two main reasons. First, the projectof making our vague ideas about language elegant and fully formal is a useful one. It improves ourunderstanding of the real claims of the grammar, and it enables careful comparisons of competingideas. Second, the best models we have of human language acquisition and use are computational.That is, they regard people using language as going through some changes which can be modeledas formal derivations. The idea that the relevant changes in language learning and language use arederivations of some kind is an empirical hypothesis which may well be false, but it is the best one wehave. In my view, the main project of linguistic theory is to provide this computational account.

Since theoretical linguistics provides formal generative models of language, it implicitly treatshuman language learners and language users as computers. The existence of the artifacts we usuallycall computers is really beside the point. Computers are useful in the development of linguistics injust the way that they are useful in physics or biology: they sometimes facilitate calculations. Thesecalculations are not the reason that our pursuit is called computational. The reason the subject at handis called computational phonology is that we adopt the programmatic hypothesis that the abilities weare modeling are computational.

That said, the work reported in these notes would have been infeasible without the help of variouspieces of software. I would like to gratefully acknowledge in particular the AT&T finite state tools(Mohri et al. 1998), Albro’s OTP package (Albro 1997, 1998)., and the AT&T GraphViz1.4 graphingtools (Ellson, Gansner, Koutsofios, North).

5


2 Finite recognizers of languages

Finite systems, systems that can only have finitely many (computationally relevant) states, can recog-nize infinite languages, but only if, in recognizing any string, only a finite amount of information needsto be remembered at each point. They play an important role in recent computational phonology.

As we will see in §2.2.3 below, a language can be recognized with finite memory iff it can be definedwith a rewrite grammar in which all the rules have one of the following forms:

C → ε (where C is any category and ε is the empty sequence)C → a D (where C,D are any categories and a is any (terminal) vocabulary element)

For example, the following grammar which defines a,b∗ has this form:

S → ε S → aSS → bS

And the following grammar defines (ab)∗:

S → ε S → aBB → bS

These grammars branch only to the right. (It turns out that languages defined by these grammars canalso be defined with grammars that branch only to the left.)

2.1 A simple representation of finite machines

Grammars of the form shown above can be regarded as specifications of finite machines that canrecognize (or generate) the language defined by the grammar. We just think of the categories as states,the non-empty productions as rules for going from one state to another, and the empty productionsspecify the final states. The machine corresponding to the grammar above can be represented by thefollowing graph, where the initial states are indicated by a bold circle and the final states are indicatedby the double circles:

S Ba

b

This kind of machine is usually formalized with the following 5 parts. (Here we follow the fairlystandard presentation of Perrin (1990) fairly closely.)

Definition 1 A finite automaton A = 〈Q,Σ, δ, I, F〉 whereQ is a finite set of states ( = ∅);Σ is a finite set of symbols ( = ∅);δ ⊆ Q× Σ×Q,I ⊆ Q, the initial states;F ⊆ Q, the final states.

6


Definition 2 A path is a sequence c = (qi, ai, qi+1)1≤i≤n of transitions in δ. In any such path, q1 is itsorigin, qn+1 its end, the sequence a1a2 . . . an is its label, and n is its length.We add the case of a length 0 path from each state to itself, labeled by the empty string 1.To indicate that there is a a path from q1 to qn+1 labeled with a sequence a1a2 . . . an we will sometimeswrite (q1, a1a2 . . . an, qn+1) ∈ δ.

NB: We have defined finite automata in such a way that every transition is labeled with an alphabetsymbol. Since there is a 0-step path labeled ε going from every state to itself, to define a language thatcontains ε, we simply let F ∩ I = ∅.

We could allow ε to label paths that change state, with only a slight change in our definitions. Forany set S, let Sε = S∪ε. Then we revise our definition of finite automata just by letting δ ⊆Q×Σε×Q.Given such an automaton, the ε transitions can be eliminated without changing the language acceptedjust by equating all states that are related by ε transitions.

Definition 3 A path is successful if its origin is in I and its end is in F .The language L(A) accepted by the automaton A is the set of labels of successful paths.

2.2 Some basic results about finite machines

Definition 4 A language L ⊆ Σ∗ is regular (finite state, recognizable) iff for some finite automaton A,L = L(A).

Clearly, every finite language is regular. Given a set like abc,abd,acd we can construct a trivialfinite automaton like this:

0

1.1a

1.4a

2.2

a

1.2b

1.5b

2.3c

1.3c

2.1d

2.4d

For any finite language Lwe can define an acceptor like this. This language L is obviously not “minimal”– that is, it has more states than necessary. One simple step for reducing states involves sharingcommon prefixes.

Definition 5 We define the prefixes of L, Pr(L) = u| for some v, uv ∈ LDefinition 6 For finite L, the prefix tree acceptor for L, PT(L) = 〈Q,Σ, δ, I, F〉 where

Q = Pr(L);Σ is a finite set of symbols (= ∅);(w,σ ,wσ) ∈ δ iff w,wσ ∈Q,I = ε;F = L.

7


Example. PT(abc,abd,acd) is smaller than the acceptor shown above, but accepts exactly thesame language:

e aa

abb

ac

c

abc

c

abdd

acdd

2.2.1 Deterministic finite machines

Definition 7 A finite automaton is complete iff for every q ∈ Q, a ∈ Σ there is at least one q′ ∈ Qsuch that (q,a, q′) ∈ δ.

For any automaton there is a complete automaton that accepts the same language. We simply addarcs that go to a “dead” state – a state from which there is no path to a final state.

For example, PT(abc,abd,acd) is not complete, but the following automaton is, and acceptsthe same language:

8


e

a

a

xx

b

c

d

a

d

abb

ac

c

abcd

a

b

abc

c

abd

d

a

b

c

acd

d

a

b

c

d

a

b

c

d

a

b

c

d

Definition 8 A deterministic finite automaton (DFA) is a finite automaton where where δ is a functionδ : (Q× Σ)→ Q and I has at most one element.When a deterministic automaton has a path from p1 to pn labeled by a1a2 . . . an we will sometimeswrite δ(p1, a1a2 . . . an) = pn.

(A DFA can be represented by a Q× Σ matrix.)

Theorem 1 (Myhill) A language is accepted by a DFA iff it is accepted by a finite automaton.

We use P(S) to indicate the powerset of S, that is, the set of all subsets of S. The powerset of a set Sis sometimes also represented by 2S , but we will use P(S). (Note, for example, that 2n in the theoremjust below refers to a number, not to a set of sets.)

9


Proof: Given NFA A=〈Q,Σ, δ, I, F〉 defineDFA=〈P(Q),Σ, δ′, I, s ∈ P(Q)| s ∩ F = ∅〉 where

(q′i, a, q′j) ∈ δ′ iff q′j = qj| (qi, a, qj) ∈ δ and qi ∈ q′i

The proof that this DFA is equivalent is an easy induction: see for example Hopcroft and Ullman (1979,Thm 2.1) or Lewis and Papadimitriou (1981, Thm 2.3.1)

Example. The automata shown above are all deterministic. The following automaton is not:

0

1x

2

x

x

3

a

xb

We can use the “subset construction” of the previous theorem to make this machine deterministic:

0 1,2x

x

3a

b

This machine is smaller than the original, but in fact a deterministic machine can be much largerthan an equivalent nondeterministic one. Perrin (1990, p30) considers as an example a,b∗aa,bn.When n = 2 this language is accepted by the following 4 state nondeterministic automaton:

0

ab

1a

2a

b3

a

b

The corresponding deterministic automaton is this one:

0

b 1a

2a 3

b

4a

5

b

6a

7

b

ab

a

b

ab

b

a

10


Adding one state to the nondeterministic automaton, we find that its minimal deterministic equiv-alent doubles in size:

0

ab

1a

2a

b3

a

b4

a

b

0

b1

a

2

a3

b

4

a

5

b6

a

7

b

8a

9b

10 a

11

b

12

a13b

14

a

15

b

ab

a

b

a

b

a

b

a

b

a

bab

b

a

Theorem 2 There are n-state automata A such that the smallest DFA accepting L(A) has at least 2n

states.

2.2.2 The Myhill-Nerode theorem and the canonical acceptor A≡L

For finite languages L, PT(L) is not generally the minimal deterministic automaton accepting L. That is,it is not the DFA accepting L with the smallest number of states. However, it is fairly easy to construct aminimal DFA for any regular language using the equivalence classes of the Nerode equivalence relation(sometimes called the right congruence relation induced by L). These equivalence relations also giveus a characterization of the finite state languages.

Definition 9 The Nerode equivalence relation for L, x ≡L y iff for all z ∈ Σ∗, xz ∈ L iff yz ∈ L.

Theorem 3 If w ∈ L and w ≡L w′ then w′ ∈ L.

Proof: By definition, letting z = ε.

Lemma 1 If σ ∈ Σ and w ≡L w′ then wσ ≡L w′σ .

Proof: Assume σ ∈ Σ, w ∈ Σ∗ and w ≡L w′. By definition, for any x ∈ Σ∗,wx ∈ L iff w′x ∈ L So let x = σz:w(σz) ∈ L iff w′(σz) ∈ L But then

(wσ)z ∈ L iff (w′σ)z ∈ L and so wσ ≡L w′σ .

Definition 10 Given any equivalence relation ≡, the equivalence class of w is [w]≡ = x| w ≡ x.(Often we use just the brackets, leaving off the subscript when no confusion will result.)The index of equivalence relation ≡, I(≡) is the number of different equivalence classes it induces,I(≡) = |[x]|x ∈ Σ∗|.

11


Theorem 4 (Myhill-Nerode Theorem) For any language L, ≡L has finite index iff L is regular.1

Proof: (⇐) Since every regular language is accepted by some DFA A = 〈Q,Σ, δ, q0, F〉, assume L =L(A). Let x ≡A y just in case δ(q0, x) = δ(q0, y). Obviously, ≡A is an equivalence relation, and itsindex cannot be larger than |Q|. But if x ≡A y then for all z, xz ≡A yz, and so xz ∈ L iff yz ∈ L.Hence, by the definition of the Nerode equivalence relation, if x ≡A y then x ≡L y . It follows that theindex I(≡L) ≤ I(≡A), and hence I(≡L) is finite.(⇒) Assume≡L has finite index. We define the canonical acceptor for L, A≡L . We let equivalence classesthemselves be the states of the automaton, Q = [w]| w ∈ Pr(L). So, by assumption, Q is finite. Let

δ([w],σ) = [wσ] whenever w,wσ ∈ Pr(L),

F = [w]| w ∈ L, andI = [ε].

Now it is clear thatA≡L = 〈Q,Σ, δ, I, F〉 is a deterministic automaton which accepts L, since by definitionw ∈ L(A≡L) iff [w] ∈ F iff w ∈ L.

Example. The canonical acceptor for abc,abd,acd is smaller than PT(abc,abd,acd). In fact,it is this:

0 1a

2b

3

c

4

c

d

d

Corollary 1 L = anbn| n ∈ N is not regular.

Proof: Obviously, for each choice of n, [an] = [an+1], and so ≡L does not have finite index.

Corollary 2 For any regular language L, the canonical acceptor A≡L has I(≡L)−1 states if there is anystring w ∈ Pr(L), and otherwise has I(≡L) states.

Proof: Every equivalence class of≡L is a state ofA≡L except for the class of strings that are not prefixesof any sentences of L, if there are any.

Corollary 3 No DFA accepting L has fewer states than A≡L .

1The Myhill-Nerode theorem is treated in Hopcroft and Ullman (1979, §3.4) at the end of their second chapter on finiteautomata. It is treated in Moll, Arbib, and Kfoury (1988, §8.2). In Lewis and Papadimitriou (1981), the Myhill-Nerodetheorem is an exercise.

12


Proof: This is already implicit in the proof of the Myhill-Nerode theorem. Compare the machine A≡L

with states Q to any arbitrary deterministic A′ = 〈Q′,Σ, δ′, q′0, F ′〉, where L = L(A′). We show thatthere must be at least as many states in Q′ as in Q.Define: x ≡A′ y iff δ′(q′0, x) = δ′(q′0, y). Since A′ is deterministic and the values of δ′ are in Q′,|Q′| ≥ I(≡A′) − 1 – that is, ≡A′ only distinguish as many classes as there are states of Q′, plus oneother class if some strings are not in Pr(L).But notice that we also have, as in the Myhill-Nerode proof, x ≡A′ y implies x ≡L y .(This is the key point! No machine accepting L can equate strings x,y that are not equated by ≡L!)That is, I(≡A′) ≥ I(≡L). It follows then that |Q′| ≥ |Q|.

Corollary 4 Any minimal DFA A = 〈Q′,Σ, δ′, q′0, F ′〉 accepting L is isomorphic to A≡L , that is, thereis a bijection g : Q→ Q′ such that g(δ(q,σ)) = δ′(g(q),σ).

Note: There is an efficient algorithm for converting any deterministic machine accepting L into aminimal deterministic machine accepting L.2

Also notice that the previous theorem and its proof rely on the determinism of the automaton that isbeing compared to A≡L . In fact, we can get much smaller machines if we allow nondeterminism.

2.2.3 Grammatical representations of regular languages

Definition 11 A rewrite grammar G = 〈V,Σ, P , S〉 whereV is a finite set of symbols ( = ∅)Σ ⊆ V , the terminal symbols;P ⊆ V∗(V − Σ)V∗ × V∗;S ∈ (V − Σ).

An element 〈u,v〉 ∈ P is often written u→ v.

Definition 12 For u,w,x,y ∈ V∗, uxw ⇒ uyw iff x → y is in P .⇒∗ is the reflexive, transitive closure of ⇒.

Definition 13 The language generated by grammar G, L(G) = w ∈ Σ∗| S ⇒∗ wDefinition 14 Given a grammar G, the sequence w0,w1, . . . ,wn is a derivation of wn from w1 iffwi ⇒ wi+1 for all 0 ≤ i < n. If w0 = S, this is a derivation of w from G.

We generalize the grammar form of the introduction just slightly, to allow single terminals as wellas the empty string on the right sides of productions:

Definition 15 G is right linear iff every production in P has one of the following forms, where σ ∈(Σ∪ ε), A,B ∈ (V − Σ):

A → σBA → σ

2Cf. Algorithm 4.5 of Aho, Hopcroft, and Ullman (1974, pp158,162); Watson (1993).

13


Lemma 2 If a language L ⊆ Σ∗ is accepted by automaton A, then it is generated by a right lineargrammar.

We leave this as an exercise.

Lemma 3 If L is generated by a right linear grammar, then L is accepted by a A.

Proof: (⇒) Suppose L is generated by the right linear grammar G = 〈V,Σ, P , S〉. Define A as follows:

Q = (V − Σ)∪ qf ,I = S,F = qf ,

δ(A,σ) =B|(A → σB) ∈ P if P has no rule of the form A→ σqf ∪ B|(A → σB) ∈ P otherwise.

Call this automaton A the equivalent of right linearG. It is now easy to show a correspondence betweenderivations and accepting state sequences as was done in the previous proof.

Theorem 5 L is accepted by a finite automaton A iff L is generated by a right linear grammar.

Immediate from the previous 2 lemmas.

2.2.4 The pumping lemma for regular languages

Theorem 6 If x ∈ L(A) and |x| ≥ |Q| then for some u,v,w ∈ Σ∗, x = uvw, |v| > 0 and for alln ≥ 0, uvnw ∈ L(A).

Proof: Assume x ∈ L(A), |x| ≥ |Q|. Then there is a successful path(q0, a1, q1), (q1, a2, q1) . . . , (qn−1, an, qn)

where x = a1 . . . an. In particular, q0 ∈ I, qn ∈ F,a1 . . . an = x and n ≥ |x|. Since |x| ≥ |Q|, n ≥ |Q|,and so there are some qi, qj, 0 ≤ i < j ≤ n such that qi = qj and |ai+1 . . . aj| > 0. Let

u = a1 . . . ai,v = ai+1 . . . aj,w = aj+1 . . . an

We noted already that |v| > 0. The string uvw ∈ L(A) by assumption, but we now show that for alln ≥ 0, uvnw ∈ L(A).

So there is a successful path

(q0, a1, q1), . . . , (qi−1, ai, qi), . . . , (qj, aj, qj+1), . . . , (qn−1, an, qn),

such that qi = qj . So instead of going from qi−1 to qi we can go from qi−1 to qj . It follows that

(q0, a1, q1), . . . , (qi−1, ai, qj), (qj, aj, qj+1), . . . , (qn−1, an, qn)

is a successful path. Consequently, uv0w ∈ L(A). (For any string v, v0 = ε.)Furthermore, instead of going from qj−1 to qj , we can just as well go back into qi to repeat thesequences 〈qi, . . . , qj−1〉 and 〈ai+1, . . . , aj〉 any number of times. Consequently, uvnw ∈ L(A) for alln ≥ 0.

14


2.2.5 Regular languages are closed under union

Given two finite state machines, we can easily construct a finite state machine that accepts the unionof the two languages.

GivenA1 = 〈Q1,Σ1, δ1, I1, F1〉 andA2 = 〈Q2,Σ2, δ2, I2, F2〉, we can assume without loss of generalitythat A1 ∩A2 = ∅. Then define

A = 〈Q1 ∪Q2,Σ1 ∪ Σ2, δ1 ∪ δ2, I1 ∪ I2, F1 ∪ F2〉.

It is easy to show that this automaton accepts exactly the language L(A1)∪ L(A2).

2.2.6 Regular languages are closed under intersection

Given two finite state machines, we can easily construct a finite state machine that accepts the inter-section of the two languages.

Given A1 = 〈Q1,Σ, δ1, I1, F1〉 and A2 = 〈Q2,Σ, δ2, I2, F2〉, define A = 〈Q1 ×Q2,Σ, δ, I1 × I2, F1 × F2〉,where for all a ∈ Σ, q1, r1 ∈ Q1, q2, r2 ∈Q2,

([q1, q2], a, [r1, r2]) ∈ δ iff (q1, a, r1) ∈ δ1 and (q2, a, r2) ∈ δ2.

It is easy to show that this automaton accepts exactly the language L(A1)∩ L(A2).

2.2.7 Regular languages are closed under concatenation

Given two finite state machines, we can easily construct a finite state machine that accepts the con-catenation of the two languages.

Given A1 = 〈Q1,Σ, δ1, I1, F1〉 and A2 = 〈Q2,Σ, δ2, I2, F2〉, intuitively, we merge all the elements ofF1 with all the elements of I2, so that δ maps an input a1 and an element q1 of F1 to everything thatδ maps it to, together with each q2 that δ2 maps an initial state to.

2.2.8 Regular languages are closed under complements

Given a finite state machine A that accepts L(A) ⊆ Σ∗, we can easily construct a finite state machinethat accepts Σ∗−L(A). Intuitively, we determinize A and then enrich it so that every element of Σ canbe accepted from every state, if only to map the state to a “dead” state from which no final state canbe reached. Then, we construct a new machine which is like the first except that it has as final statesall the states that are non-final in the previous machine.

15


2.3 Finite machines with output: transducers

We can easily extend finite machines by providing each transition with an output. For example, we canmodify the first fsm diagram from §1 to get a machine which maps each string from (ab)n to (ba)n,the result of simultaneously switching all the a’s and b’s.

S Ba:b

b:a

Input-output relations defined in this way are often called rational transductions.This kind of machine is usually formalized with the following 6 parts, where for any set S, Sε = S∪ε.

Definition 16 A finite transducer A = 〈Q,Σ1,Σ2, δ, I, F〉 whereQ is a finite set of states ( = ∅);Σ1 is a finite set of input symbols ( = ∅);Σ2 is a finite set of output symbols ( = ∅);δ ⊆ Q× Σε1 × Σε2 ×Q,I ⊆ Q, the initial states;

F ⊆ Q, the final states.

NB: As will become clear, adding ε to the possible transition labels allows transductions to be definedthat could not be defined otherwise. (Remember that in the case of finite automata, we have fullgenerality even when we allow only single alphabet symbols to label transitions.)

2.3.1 Domains and ranges of rational transductions are regular

Given a finite transducer, removing the outputs, and then eliminating ε transitions (as described in§2.1) yields a finite machine that accepts the range of the transduction. Removing the inputs and theneliminating ε transitions yields a finite machine that accepts the range of the transduction.

2.3.2 Rational transductions are closed under union

Like the construction of the union machine in §2.2.5, it is easy to construct a machine A which definesa relation R ⊆ Σ1 × Σ2 = R1 ∪ R2 where R1 is the transduction defined by a finite transducer A1 andR2 is the transduction defined by a finite transducer A2.

2.3.3 Rational transductions are not closed under intersection

This is easily established by noting that we can easily define a transduction from an to bnc∗ and atransduction from an to b∗cn, but the intersection of these relations maps an to bncn, which cannotbe defined by a finite machine.

16


2.3.4 Some rational transductions are essentially nondeterministic

The following transducer has no deterministic equivalent (Barton, Berwick, and Ristad, 1987). Givenstrings xna or xnb, the machine cannot deterministically decide whether to start emitting a’s or b’s.Of course, some transducers can be determinized – see e.g. Roche and Schabes (1997a, §7.9) for analgorithm.

0

1x:a

2

x:b

x:a

3

a:a

x:b b:b

2.3.5 Rational transductions closed under intersecting their domains with regular languages

Given a finite state transducer T and a finite state machine A, we can easily construct the finite statetransducer which defines the restriction of the transduction of T to the intersection Dom(T)∩A.

This point is not theoretically central, but it has practical applications and so it is mentioned in,for example, Roche and Schabes (1997b, §1.3.7). We will use it in the next section.

Given T = 〈Q1,Σ,Σ2, δ1, I1, F1〉 and A = 〈Q2,Σ, δ2, I2, F2〉,define T ′ = 〈Q1 ×Q2,Σ,Σ2, δ, I1 × I2, F1 × F2〉, where for all a ∈ Σ, b ∈ Σ2, q1, r1 ∈Q1, q2, r2 ∈ Q2,

([q1, q2], a, b, [r1, r2]) ∈ δ iff (q1, a, b, r1) ∈ δ1 and (q2, a, r2) ∈ δ2.

NB: to execute this intersection, it is important to keep in mind the “0-step path” that we have in ourdefinition of finite automata: intuitively, there is a path from every state to itself accepting the emptystring.

2.3.6 Rational transductions closed under inverses

This point is mentioned by Yu (1997, p68). We simply interchange the input and output symbolslabeling each transition.

2.3.7 Rational transductions closed under compositions

Kaplan and Kay (1994) establish this one. Given T = 〈Q1,Σ1,Σ2, δ1, I1, F1〉 andA = 〈Q2,Σ2,Σ3, δ2, I2, F2〉,

define T ′ = 〈Q1×Q2,Σ,Σ2, δ, I1× I2, F1× F2〉, where for all a ∈ Σ1, b ∈ Σ2, c ∈ Σ3, q1, r1 ∈Q1, q2, r2 ∈Q2,

([q1, q2], a, c, [r1, r2]) ∈ δ iff (q1, a, b, r1) ∈ δ1 and (q2, b, c, r2) ∈ δ2.

17


2.4 Exercises

1. Draw the minimal deterministic automaton that accepts

CV,CVC,VC,V(.CV,CVC,VC,V)∗

2. Draw the minimal deterministic transducer which maps a sequence w ∈ C,V, .∗ to xn iff wcontains n occurrences of .C

3. Intersect the domain of the previous transducer with the language defined in the first exercise, anddraw the result.

4. Use Nerode’s theorem to show that xx| x ∈ a,b∗ is not regular.

18


3 Some early proposals

(1) (Johnson, 1972): rules like

N →m/_p; elsewhere np →m/m_

can be implemented by a transducer in which maps symbols on the left sides of rules to thesymbols in the right sides, in context. Restricting our attention to inputs and outputs overΣ = N,p,m we get the following transducers T1, T2 for the preceding 2 rules:

0

p:pm:mk:ka:an:n

2

N:m

1

N:n

p:p

n:n

k:k

a:a

m:m

N:m

N:n

0

k:ka:a

N:Np:pn:n

1

m:m

N:N

n:n

a:a

p:m

k:k

a:a

m:m

(2) For any finite set S of strings, define the prefix tree transducer of S, ptt(S) to be the prefix treeextended to be the identity transduction on S. So for example, ptt(kaNpan) is this machine:

0 1k:k

2a:a

3N:N

4p:p

5a:a

6n:n

(3) (Kaplan and Kay, 1994): The set of finite transducers is closed under composition.

So to see what T1 does to kaNpan we can compute ptt(kaNpan) T1:

0 1k:k

2a:a

3N:m

4p:p

5a:a

6n:n

What would T1 Teg represent? In fact, T1 ptt(kaNpan) accepts nothing.Exercise: explain why.)

To see what T2 does to kaNpan we can compute ptt(kaNpan) T2:

19


0 1k:k

2a:a

3N:N

4p:p

5a:a

6n:n

In this case, the other composition exists too, T2 ptt(kaNpan) :

0 1k:k

2a:a

3N:N

4p:p

5a:a

6n:n

(4) Now consider T1 T2:

0

p:pk:ka:an:n

1

m:m

2

N:m

3

N:n

a:a

a:a

n:n

p:m

k:k

m:m

N:m

N:n

p:m

k:k

a:a

n:n

m:m

N:m

N:n

For the case where T1 and then T2 is applied to the example, we compute ptt(kaNpan)T1T2:

0 1k:k

2a:a

3N:m

4p:m

5a:a

6n:n

(5) Can the composed relation T1 T2 be represented by rewrite rules? It depends exactly what ismeant by “rewrite rule.” There are fairly simple rewrite systems that have exactly the effect of

20


the composed transduction, for example, the following rules which in effect keep the contexton the left side:

Np→mmNx →nx for each x ∈ (Σ− p)mp→mm

(6) (Koskenniemi, 1983): Even though the intersection of finite transducers T1, T2 is not generallya finite transduction, it is computable:

Let’s call this a two level automaton: it accepts a lexical : surface pair iff that pair isaccepted by every one of the transducers fst1, . . . , f stn. (We leave aside for now the questionof how such a thing really works.)

(7) “Two-level” rules can refer to both underlying and surface forms (Koskenniemi, 1983; Kart-tunen, 1991), defining what one of our component transducers fsti should allow:

α : β if . . .α : β only if . . .α : β iff . . .α : β never . . .etc.

(8) Karttunen considers the r[ayD]ing/wr[∧yD]ing contrast that has sometimes been taken as anargument for ordering vowel raising before flapping (Bromberger and Halle, 1989):

ay→ ∧y/_[-voice]t,d→D/V_V

These rules can be represented by finite state transducer T3, T4 (here we use A for ∧):

21


0

r:rd:dD:D

Ay:Ayt:t

er:er

1

ay:ay

2

ay:Ay

Ay:Ay

er:er

D:D

r:r

d:d

ay:ay

ay:Ay

t:t

0

r:rt:t

D:D

1

ay:ay

Ay:Ay

er:er

r:r

D:D

ay:ayAy:Ayer:er

3

t:D

d:D

2

t:t

d:d

ay:ay

Ay:Ay

er:er

t:t

d:d

r:r

D:D

We can compute the result of applying one rule after the other T3 T4:

22


0

r:rD:Dt:t

1

Ay:Ay

er:er

3

ay:ay

4

ay:Ay

r:r

D:D

Ay:Ayer:er

ay:ay

ay:Ay

2t:D

d:D

r:r

Ay:Ay

er:er

ay:ay

ay:Ay

d:D

D:D

t:D

Ay:Ay

er:er

ay:ay

ay:Ay

This is complex! For a first check on whether it is doing the right thing, we can look atptt(rayter) T3 T4 and ptt(rayder) T3 T4:

0 1r:r

2ay:Ay

3t:D

4er:er

0 1r:r

2ay:ay

3d:D

4er:er

This is what we wanted.

(9) Consider now T4 T3:

23


0

r:rt:t

D:D

1

ay:ay

2

ay:Ay

4

Ay:Ay

er:er

r:r

D:D

ay:ay

ay:Ay

er:er

Ay:Ay

5

d:D

t:D

3

d:d

t:t

r:r

D:D

ay:ay ay:AyAy:Ayer:er d:D

t:D

t:t

d:d

ay:ay

ay:AyAy:Ay

er:er

t:t

d:d

r:r

D:D

Looking at ptt(rayter) T4 T3 and ptt(rayder) T4 T3, we have:

0 1r:r

2ay:ay

3t:D

4er:er

0 1r:r

2ay:ay

3d:D

4er:er

This is not what we want: the standard account is that the rules are applying in the wrong orderhere.

24


(10) Now consider the two level rules, which apply simultaneously, with contexts that can refer toeither surface or underlying form (or both):

ay:∧ if _[-voice]:t:D|d:D if V:_V:

Consider transducer T3 – does this implement the first of these two-level rules? No. We cansee that the vowel change only occurs if the vowel is followed by an underlying voiceless seg-ment. That’s good. But we need to make sure that the vowel change occurs always, if it isfollowed by an underlying voiceless segment. What we want is a transducer that will acceptall underlying:surface pairs through except those that fail to raise the vowel in the indicatedcontext.3

So we need a transducer that is not just subtly different from T2, the transducer T 2L3 that

implements the first two level rule is this one, which is really the first one which simply mustbe abbreviated to be readable. We use S for Σ = ay,Ay,er,d,r,t,D; we use +v for the voiceday,Ay,er,d,r,D; we use -v for the unvoiced t:4

3When Karttunen (1991) introduces the two level rules and observes a difference between the transducer in his Figure12 for his two-level rule 5b and the transducer in his Figure 5 for his rewrite rule 3b, he says “we reach that state [1] byencountering a surface m which can be a realization of either m or N on the lexical side.” The important point to noticeis that the restrictions on what m can be a realization of are not given in either rule, neither in 3b nor in 5b. So for ageneral approach, we want to reach his state 1 with a surface m, regardless of what was underlying that m. We take thegeneral approach here.

4Notice that we let outputs range over Σ in some of the transitions shown here; we do not let them range over Σε. Thatis, we are not allowing for arbitrary deletions, or arbitrary insertions either. Clearly, in some cases we will need to allowfor deletions and insertions, but for the moment we put off consideration of the issues raised by these operations.

25


0

r:rr:t

t:Dt:ayt:Ayt:erd:rd:td:dd:Dd:ayd:Ay

r:d

d:erer:rer:ter:der:Der:ayer:Ayer:erAy:rAy:t

r:D

Ay:dAy:DAy:ayAy:AyAy:erD:rD:tD:dD:DD:ay

r:ay

D:AyD:er

r:Ayr:ert:rt:tt:d

1

ay:Ay

2

ay:r

ay:t

ay:d

ay:D

ay:ay

ay:Ay

ay:er

t:r

t:t

t:d

t:D

t:ay

t:Ay

t:er

D:r

D:t

D:d

D:D

D:ay

D:Ay

D:er

r:r

r:t

r:d

r:D

r:ay

r:Ay

r:er

ay:r

ay:t

ay:d

ay:D

ay:ay

ay:Ay

ay:er

Ay:r

Ay:t

Ay:d

Ay:D

Ay:ay

Ay:Ay

Ay:er

er:r

er:t

er:d

er:D

er:ay

er:Ay

er:er

d:r

d:t

d:d

d:D

d:ay

d:Ay

d:er

0

S-ay:S 1

ay:Ay

2

ay:S

-v:S

+v:S

The situation is similar for transducer T4. Transducer T4 allows flapping to occur if the conso-nant is surrounded by underlying vowels, but it does not require flapping to occur only if this

is the case. For that, we need T 2L4 (and we don’t try to display the full form!). As before, we use

S for Σ = ay,Ay,er,d,r,t,D, and now we use V for ay,Ay,er:

26


0

S-V:S

1V:S

S-tdV:S

V:S

2

t:D

d:D

3t:S-D

d:S-D

S-V:S

V:S

S-V:S

(11) Consider transducer T 2L3 T 2L

5 – does this implement what we want from both of the two-levelrules? This system is so complex that we cannot display it.

To check one case, we can compute ptt(rayter) T 2L3 and ptt(rayder) T 2L

3 :

0 1

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

2ay:Ay

3

t:d

t:Ay

t:t

t:er

t:D

t:r

t:ay

4

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

27


0 1

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

2

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

3

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

4

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

And we compute ptt(rayter) T 2L4 and ptt(rayder) T 2L

4 :

0

1

r:r

r:er

r:D

r:ay

r:d

r:Ay

r:t

5

r:r

r:er

r:D

r:ay

r:d

r:Ay

r:t

2

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

3t:D

4

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

28


0

1

r:r

r:er

r:D

r:ay

r:d

r:Ay

r:t

5

r:r

r:er

r:D

r:ay

r:d

r:Ay

r:t

2

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

3d:D

4

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

And finally we compute ptt(rayter) T 2L3 T 2L

4 and ptt(rayder) T 2L3 T 2L

4 :

29


0

1

r:r

r:er

r:D

r:ay

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:d

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:Ay

r:D

r:t

10

r:r

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:r

r:ay

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:d

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:Ay

r:t

2

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

3

t:d

t:Ay

t:t

t:er

t:D

t:r

t:ay

t:d

t:Ay

t:t

t:er

t:D

t:r

t:ay

t:d

t:Ay

t:t

t:er

t:D

t:r

t:ay

7

t:r

t:ay

t:d

t:Ay

t:t

t:er

t:r

t:ay

t:d

t:Ay

t:t

t:er

9

t:D

t:D

4

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

5

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:r

er:ay

er:d

er:Ay

er:t

er:er

6er:D

er:D

8

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

0

1

r:r

r:er

r:D

r:ay

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:d

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:Ay

r:D

r:t

12

r:r

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:r

r:ay

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:d

r:r

r:ay

r:d

r:Ay

r:t

r:er

r:D

r:Ay

r:t

2

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

9

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:D

13

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

ay:r

ay:ay

ay:d

ay:Ay

ay:t

ay:er

14

ay:D

ay:D

7

8

10

4

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

11

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

5

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

3

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

6

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:r

er:ay

er:d

er:Ay

er:t

er:er

er:D

er:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

d:r

d:ay

d:d

d:Ay

d:t

d:er

d:D

(12) Karttunen’s suggestion, following Kaplan and Kay and others, is that what we want from bothof the two level rules is the relation T 2L

3 ∩T 2L4 . In fact, Karttunen claims that T3 T4 is the same

as T 2L3 ∩ T 2L

4 .5 Is this true? How could we establish this?

(13) In fact, it is easy to see that, as we have defined themachines, T3T4 is not the same as T 2L3 ∩T 2L

4 .This follows trivially from the fact that nothing in T 2L

3 ∩ T 2L4 requires r to be unchanged, while

this is required by T3 T4.Open question: Is there an understanding of Karttunen’s claim that makes it true (or at leastplausibly true)?

Open question: Are there any feasible algorithms that could decide whether a composed trans-ducer and a two level automaton define the same relation. (ES conjecture: no)If not, we are really stuck here, because the composed machine T3 T4 is fairly complex, andthe two level automaton T 2L

3 ∩ T 2L4 is even much more so! Let’s explore the power of two level

automata just briefly.

5“The composition of the two transducers implementing [rewrite rules] (8a) and (8b) is the same as the intersection ofthe automata corresponding to [the two-level rules in] (9)” (Karttunen, 1991, §4.1).

30


(14) To decide whether two level automata are appropriate models for human phonology or mor-phology, we can consider:

a. Do these models appropriately capture the properties of (generalizations about) humanphonology and morphology?

Here, we can notice that the two level models enforce a kind of correspondence betweenunderlying and surface forms, anticipating one aspect of some recent proposals inphonology. We will return to this later.

b. Do these models appropriately constrain the space of possibilities, allowing the possibilityof explaining why many non-human systems never occur?

It is sometimes fairly easy to bring empirical evidence to bear on this question, evidenceabout fairly basic and general properties of the language. We turn to this question now,because it sets the stage for understanding later significant developments in computa-tional phonology.

(15) We have seen that we can write a transducer that defines the relation (an, bnc∗) and one thatdefines (an, b∗cn). The intersection of these two transductions is the relation (an, bncn). Wecan use Nerode’s theorem to see that the range of this relation is not regular. Regular languagesnever have two counting dependencies. In fact, bncn is context free. But context free languages(CFLs) never have more two counting dependencies so bncndnen is not context free, but it is atree adjoining language (TAL). In turn, TALs never have more than four counting dependencies,so bncndnenfngn is not a TAL, but it is a 2C-TAL.It is easy to see that two-level automata can define relations whose ranges are languages withany number of counting dependencies.

Exercise: A “copy language” is a language whose strings are n > 1 repetitions of some sub-string. For example, x2| x ∈ a,b∗ is a simple copy language containing strings like:abab, baabaa, . . . . To find the copied substrings we can just split any word in half.A slightly simpler copy language might mark the beginning of each copy somehow, as in(cx)2| x ∈ a,b∗. This language contains strings like: cabcab, cbaacbaa, . . . .Show that two level automata can define relations whose ranges are “copy languages” con-taining any number of copies. That is, show how, for any n, we can define a two levelmachine enforcing the relation: (x, (cx)n)| x ∈ a,b∗.

(16) Recall the Chomsky hierarchy of languages.recursively enumerable languages

context sensitive languages

context free languages

regular languages

finite sets



2C-TALs = 2f-MLs

3C-TALs = 3f-MLs

MC-TALs = LCFRLs = MCFLs = MLs

...

TALs = CCLs = LILs = HLs

Since two level machines can define any number of counting dependencies, we know that theranges of two level relations are not included in any class smaller than the MCFLs.

31


(17) There is another hierarchy, though it is not known whether these inclusions are strict:

P polynomial time on a deterministic TM

NP polynomial time on a nondeterministic TM

e.g. CFL recognition, MCFL recognition

e.g. 3SAT, travelling salesman problem

PSPACE polynomial space on a deterministic TMe.g.

EXP exponential time on a deterministic TM

EXP-SPACE

CSL recognition, DFA intersection

NB: MCFL recognition is in P.The problem of deciding whether the intersection of n DFAs is non-empty is PSPACE-complete,i.e. as hard as the hardest problems in PSPACE (Kozen, 1977, Lemma 3.2.3).

(18) Barton, Berwick, and Ristad (1987): two level automata recognition problems are NP-hard (i.e.as hard as the hardest problems in NP)

The argument goes like this:

a. the problem of deciding whether a 3-CNF formula is satisfiable is NP-complete;

b. this “3-SAT” problem can be represented as a recognition problem in a two level automaton;therefore,

c. the recognition problem for two level automata can be at least as hard as 3-SAT.

The two level formulation of a 3-SAT problem is easily sketched. We will represent an arbitrary3-CNF formula like

(x ∨¬y ∨ z)∧ (¬x ∨¬z)∧ (x ∨y)

in the following simplified form:

x-yz,-x-z,xy

and we will define a two level automaton which will accept such a formula if and only if it issatisfiable. For each variable x we have an “assignment” machine which simply ensures thateach variable is mapped either to T or F throughout the whole formula. Here is the machinefor variable x:

32


0

y:Ty:Fz:Tz:F-:-,:,

1

x:T

2

x:F

y:Fz:Tz:F-:-,:,

x:Ty:T

x:Fy:Ty:Fz:Tz:F-:-,:,

And finally, we have a machine that checks every disjunction (where the disjunctions are sep-arated by commas) to make sure that at least one disjunct is true:

0

x:Fy:Fz:F

1

x:T

y:T

z:T

2

-:-

,:,

z:Fx:Ty:Tz:T-:-

x:Fy:F

x:T

y:T

z:T

x:F

y:F

z:F

(19) This raises the question: do we need a recognition system that is powerful enough to representintractable problems?While reduplication phenomena need to be accounted for, there is no evidence that we needmechanisms that can make 100 copies, or enforce 100 counting dependencies. This couldbe due just to “performance” restrictions, but we should consider whether there are weakersystems that can do the job needed without being able to do so many other things too.

33


4 Using non-deterministic machines

(1) We saw that we could represent rewrite rules as finite state transducers:


Composing these two transducers we get:

0

p:pk:ka:an:n

1

m:m

2

N:m

3

N:n

a:a

a:a

n:n

p:m

k:k

m:m

N:m

N:n

p:m

k:k

a:a

n:n

m:m

N:m

N:n

Notice that this transducer is nondeterministic on the underlying string: for example, fromstate 0 and with next underlying symbol N, we could either output m and go to state 2 oroutput n and go to state 3.

(Notice that although the transducer is nondeterministic, the transduction from inputs to out-puts is a function. That is, although it is not determined what we should do from state 0 withnext symbol N, only one path will work.)

(2) A good question: We have no way to transduce input strings efficiently using two-level ma-chines, in general, but we did not provide a way to transduce input strings efficiently usingnondeterministic finite transducers either. Is it possible?

34


(3) We also noticed that the following machine cannot be made deterministic:

0

1x:a

2

x:b

x:a

3

a:a

x:b b:b

I don’t know of a case like this in phonology, where the first output symbol depends on some-thing arbitrarily far away, but Roche and Schabes (1997a, §7.9) point out that there are caseslike this in syntax. For example, suppose one sense of keep occurs in the following sentence,call it keep1:

a. Let’s keep this new problem under control

and keep2 occurs in

b. The flood problems keep the hardest-hit areas virtually out of reach to rescuers.

The disambiguating material may be arbitrarily far away, but we can represent a simple ideaabout the disambiguation with a machine like this:

0

x:x 1keep:keep1

4

keep:keep2

x:x

2under:under

x:x

5out:out

3control:control1

6of:of

7reach:reach

This machine, like the previous one, defines a function from the input alphabet to the outputalphabet, but this function is provably not one that can be computed by any finite transducerwhose next state and next output are a function of the current state and next input.6 Can weuse a machine like this efficiently?

(4) A prior question: As discussed earlier, every non-deterministic recognizer has an equivalentdeterministic one, but the deterministic one can be exponentially larger. We saw that Perrin(1990, p30) considers the following 4 state nondeterministic automaton AΣ∗aΣ2 :

0

ab

1a

2a

b3

a

b

The corresponding minimal deterministic automaton is this one:6The functions defined by the transducers displayed here are not “subsequential” in the sense of Roche and Schabes

(1997b, §1.3.8).

35


0

b 1a

2a 3

b

4a

5

b

6a

7

b

ab

a

b

ab

b

a

Adding one state to the nondeterministic automaton, we find that its minimal deterministicequivalent doubles in size, AΣ∗aΣ3 :

0

ab

1a

2a

b3

a

b4

a

b

0

b1

a

2

a3

b

4

a

5

b6

a

7

b

8a

9b

10 a

11

b

12

a13b

14

a

15

b

ab

a

b

a

b

a

b

a

b

a

bab

b

a

Clearly, when dealing with larger grammars, this kind of explosion in states can pose insur-mountable practical problems. Is there a feasible way to decide whether a string is accepted bya nondeterministic finite automaton, without exploding its size exponentially?

(5) In the first place, it is obvious that, without increasing machine size, dead states – states thatdo not lie on any path from an initial to a final state – can be eliminated. (In the AT&T tools,this is done by fsmconnect.)

(6) Trying one path and backtracking if it fails is the most simplest strategy for recognition witha nondeterministic acceptor. (Programmers’ tools like flex use this “greedy” first-path methodwith (hopefully limited) backtracking.)

(7) We can also use an “all paths at once,” “dynamic programming” recognition method. Withthis method, we keep a table, a “chart” of intermediate results rather than a record of “choicepoints” that we might need to backtrack to.

Given any finite automaton A we extend it to an identity transducer I(A) on the language L(A).Calculating I(A) ptt(Input) is essentially identical to what is sometimes called “chartparsing,” and is known to require less than On3 time. If a final state is reached, then Input ∈

36


L(A). (In the AT&T tools, this is done by fsmcompose.)

(8) Recall how compositions are calculated.Given T = 〈Q1,Σ1,Σ2, δ1, I1, F1〉 and A = 〈Q2,Σ2,Σ3, δ2, I2, F2〉,define T ′ = 〈Q1 ×Q2,Σ,Σ2, δ, I1 × I2, F1 × F2〉,where for all a ∈ Σ1, b ∈ Σ2, c ∈ Σ3, q1, r1 ∈ Q1, q2, r2 ∈Q2,

([q1, q2], a, c, [r1, r2]) ∈ δ iff (q1, a, b, r1) ∈ δ1 and (q2, b, c, r2) ∈ δ2.

Consider for example, I(AΣ∗aΣ2) ptt(aaba). In effect, in attempting to build this machine,we are asking: does I(AΣ∗aΣ2) have aaba as a possible output? In constructing this automaton,we consider no paths with length longer than 4, and eventually we will find the only live path:

0

1a:a

2

a:a

3a:a

4a:a

5a:a

6b:b

7b:b

8b:b

9a:a

10a:a

11a:a

0 1a:a

2a:a

3b:b

4a:a

Considering I(AΣ∗aΣ3) ptt(aaaba), we find that the problem has not doubled, the way thedeterminized version of AΣ∗aΣ3 does:

0

1a:a

2

a:a

3a:a

4a:a

5a:a

6b:b

7b:b

8b:b

9a:a

10a:a

11a:a

12a:a

0 1a:a

2a:a

3b:b

4a:a

(9) Now let’s return to the question of how to use nondeterministic transducers. We know thatthey cannot generally be determinized.

First of all, we can imagine cases worse than the machines shown in (1) and (3). Those machinesboth define functions, and furthermore both machines are unambiguous in the sense that eachinput labels at most one successful path from an initial state to a final state.

37


For example, the following transducer defines a function, but it is ambiguous because there ismore than one successful path for the input xxa:

0 1x:a

2

x:b

3x:a x:a

4

a:a

x:bb:b

x:a

In (10ff) we consider that possibility that we have an ambiguous machine that computes afunction. In this case, it is always possible to remove the ambiguity efficiently.

Once the ambiguity is removed, in (14ff) we explore one elegant way to compute transductions,even though the machine may still be non-deterministic, like those in (1) and (3) are. This canbe done efficiently.

Finally, in (17) we briefly consider the worst case: a transducer which does not define a function.

(10) Lemma: If a transducer defines a function, it has an equivalent in which, whenever there aretwo different paths labeled with the same input:

(q0, a1, b1, q1), (q1, a2, b2, q2), . . . , (qn−1, an, bn, qn)(q0, a1, b′1, q

′1), (q′1, a2, b

′2, q

′2), . . . , (q′n−1, an, b′n, q′n)

then there is some point j where bj = b′j .Proof: Treating the transducer T as an acceptor that accepts sequences of pairs, we simplyapply the subset construction given in Myhill’s Theorem to obtain a deterministic equivalentdet(T).

(11) When we compute the equivalent det(T) for the last displayed transducer T , we see that theresult is unambiguous. This will not always suffice though. Consider, for example,

0

1x:0

2

x:b 3

x:b

x:0

(12) We can represent the different parses of the different paths of det(T) labeled with inputa1 . . . an, where each ai is transduced to bi:

b1 • b2 • · · · • bnb′1 • b′2 • · · · • b′n

Notice that different parses can have the same concatenation:

x • ε • xx • x • ε38


The different parses can be ordered, even when they have the same concatenation. For example,one natural alphabetic order where w < x:

(w • ε • x) < (x • ε • x) < (x • x • ε)

(13) Theorem: (Eilenberg) If a transducer defines a function, it has an unambiguous equivalent.

We construct the unambiguous equivalent by selecting the minimal paths. We sketch how thiscan be done (details in Roche and Schabes, §1.3.6). The equivalent machine has states whichare pairs (x, S) of states x and sets of states S of the original machine, where the set of states Scontains all states strictly smaller than x which could have been reached with the same inputsthat lead to x. To make sure we construct only minimal paths, we block the addition of states(x, S) when x ∈ S, since this indicates that a state labeled with x can be reached with the sameinput along a strictly smaller path. A state (x, S) is initial if x is initial in the original machine,ant the state is final if x is final in the original machine.

Applying this method to the last displayed transducer we obtain the following.

0,

1,x:0

2,1

x:b

3,x:b

No transition is added from 2, 1 to 3, 3 since 3 ∈ 3 – that is, there is a strictly smallerpath to state 3 in the original machine than the one that goes through 2. The resulting deadstate could be pruned away.

(14) Now we turn to our original, first question: how to compute the transductions defined bynondeterministic machines like the ones shown in (1) and (3), unambiguous machines thatdefine functions, but which are not deterministic.

Schützenberger (1961) proposes an efficient approach which is also described in Roche andSchabes (1997b, §1.3.10).

(15) For any automaton A, let det(A) be the deterministic automaton obtained with the subsetconstruction given in Myhill’s Theorem.

For any automaton A, let rev(A) be the result of reversing all the transitions (q1, a, q2) in A,so that rev(A) has just the corresponding transitions (q2, a, q1), and interchanging I and F .For any transducer T , let 1(T) be the “first projection” of T , that is, the finite acceptor obtainedby removing the output from each arc. And let 2(T) be the “second projection” of T .

(16) A bimachine contains a pair of finite automata, one of which, in effect, processes the input inreverse. The finite automata in a bimachine have no final states.

Given a transducer T = (Σ1,Σ2, δ, I, F) that defines a partial function T : Σ∗1 → Σ∗2 , the bimachinebi(T) is given by two alphabets, two finite automata, and an “emission” function:

(Σ1,Σ2, A1, A2,∆)

39


where

A1 = (Σ1,Q1, I1, F1, δ1) = det(1(T))A2 = (Σ1,Q2, I2, F2, δ2) = det(rev(1(T))),∆ : Q1 × Σε1 ×Q2 → Σ∗2 where for all S1 ∈ Q1, S2 ∈Q2, a ∈ Σ1

∆(S1, a, S2) = b iff there are q1, q2 ∈Q, (q1, a, b, q2) ∈ δ, q1 ∈ S1, q2 ∈ S2

We extend ∆ to strings as follows:

∆(q1, ε, q2) = ε for all q1 ∈ Q1, q2 ∈Q2;∆(q1,wa, q2) = ∆(q1,w, δ2(q2, a))6∆(δ1(q1,w), a, q2)

The emission function ∆ can be represented as a table, and then we can compute the originaltransduction by finding a successful path through A1, then going through A2 in reverse andemitting the output.

(17) We are now in a position to understand what can be done if we want to compute the transductionof a string, when the transduction is not a function at all. As in the case (7) of acceptorswe simply intersect the prefix acceptor of the string with the domain of the transducer, orequivalently, compose the prefix tree transducer of the string with the transducer. As alreadynoted, “chart parsing” approach step is efficient (better than On3) and yields a machine thatrelates the input string to everything the transducer relates it to.

40


5 One level phonology

5.1 Bird and Ellison 1994

(1) a. Following Johnson (1972) and others, we saw that rewrite rules can be represented as trans-ducers mapping the left side to the right side. These transducers can be composed.

b. Following Koskenniemi (1983), Karttunen (1991) and others, we saw that we can obtaina more powerful rule system, possibly one that does not need rules to apply iteratively atall, by defining transducers that constrain underlying:surface representations, in “two-levelrules.”

The intersection of the transductions defined by a set of two level rules is called a two-levelautomaton.

Two level automata are very expressive, capable of defining languages that are more power-ful than the well-known grammars in syntax, and capable of defining intractable problems.

c. We saw that, unlike two-level automata, determining whether a sequence is accepted bya non-deterministic finite automaton is perfectly tractable, though of course not as time-efficient as deterministic finite automata.

In particular, the standard method for computing a composition of two automata can beused to find intersections with reasonable efficiency, providing a way to check whether anarbitrary automaton accepts an arbitrary string. This procedure is a “dynamic program-ming” method – we keep a record of all the paths through the deterministic machine, up tothe point when we identify a successful one and then we can stop.

(2) Output can be associated with states rather than arcs. “Markov models” and “Moore machines”typically associate output with states.

(3) A Moore machine is given by

Q a finite set of statesΣ1 a finite input alphabetΣ2 a finite output alphabetδ : Q× Σ1 → Q a deterministic transition functionλ : Q→ Σ2 the output functionq0 a singleton set of initial states

(See e.g. Hopcroft and Ullman 1979, §2.7; Savage 1976, §4).

(4) Given a Moore machine (Q,Σ1,Σ2, δ0, λ, q1), the following transducer accepts the same lan-guage: T = (Q,Σ1,Σ2, δ, q0,Q) where q0 ∈ Q and :

(qi, a, b, qj) ∈ δ iff either i > 0, qj = δ0(qi, a), b = λ(δ0(qi, a))or i = 0, a = ε, b = λ(q1).

The outputs on the arcs of the transducer correspond to the outputs of their destination states.

41


(5) Consider for example the following Moore machine in which the arcs are labeled with inputsand the states are labeled with outputs:

q1/0

0

q2/11

1

0

This machine maps binary strings 0,1+ to their “parity:” after the 0 output from the initialstate, a 1 is output whenever the number of 1’s read from the input is even, and a 0 is outputwhenever the number of 1’s read from the input is odd. For example, we have

00101 00011011101 010110

The corresponding transducer is this one:

q0 q1e:0

0:0

q21:1

1:0

0:1

(6) Bird and Ellison (1994) define “state labeled automata” (SLA), which they describe as Moore ma-chines that ignore their input. Since these machines ignore their input, empty output requiresa special treatment.

SLA also differ fromMooremachines in (i) having a specified set of final states, (ii) the transitionsare not required to be deterministic, and (iii) the output labels of a state are sets of symbols.

Bird and Ellison claim that SLA are well suited for implementing autosegmental phonology.

(7) The following SLA, in which every state is an initial state, does not allow two occurrences ofthe same symbol in a sequence – a constraint analogous to the OCP:

a b

c d

ε> > >

> >

The following nondeterministic SLA has 5 states and 12 transitions. Here is an equivalentdeterministic finite automaton:

42


0

1

a2

b

3

c

4

d

b

c

d

a cd

a

b

da

b

c

This automaton has 5 states and 16 transitions. (The finite automaton displayed by Bird andEllison is slightly smaller since it is nondeterministic, with multiple initial states.) If, insteadof 4 symbols we have 8, then the SLA needs 8+1 states; each of the 8 states has an incomingarc from each of the other states. The deterministic finite automaton also has 9 states and 64arcs. The difference is in the number of labels, since there are more arcs than states.

(8) Representing association of autosegments:

... A ...

... B ...

This association is first visualized as “synchronized” SLA:

A A A> >

B B B

Then the synchronized SLA are implemented in SLA which check multiple tiers at once:

A∩B>

43


The corresponding finite automaton is this:

0 1AnB

AnB

Bird and Ellison indicate that these automata should accept the following sequences, amongothers (presumably they are using the center dot to signify “anything”):

A·

A·

AB·B·B

A·

A·

AB

AB·B

·B·B

AB

So perhaps the intention is that the previous automata be equivalent to something like this,using a colon to separate the A and B tiers:

0

-:-

1A:B

-:-

Here we use a - where Bird and Ellison used a center dot for the whole alphabet Σ. Assumingthen that we can have A or 0 on the A tier, and similarly for the B tier, then the previous machinecan be represented in the following fully explicit, and familiar, form:

0

0:0A:00:B

1A:B

0:0A:00:BA:B

Instead of using explicit pairs this way, we could replace A : B with any element that is in bothA and B and similarly for all the other pairs. – Presumably this is what Bird and Ellison intend.

44


(9) Consider the slightly more complicated diagram:

A B

C

The following SLA representation is provided:

>AnC BnC

Bird and Ellison indicate that this means that on one tier, A should be immediately followed byB, and C occurs on another tier, overlapping on at least one point with each of A and B. So thecorresponding finite automaton is presumably this:

0

0:0A:0B:00:CB:C

1A:C

A:C

2B:C

B:00:CA:CB:C

0:0A:0

Bird and Ellison introduce another, “macro” notation for associations of segments like this one:

(A : 1+ B : 1) (C : 2)

The square intersection is presumably the “intersection” operation on automata; that is, theproduct construction. So the machine denoted by this formula accepts segment A with oneassociation is followed by segment B with one association on one tier, while C occurs on anothertier with two associations.

(10) Consider one more example from Bird and Ellison:

A B

C D

The following SLA representation is provided:

45


AnC

>BnD

AnD

BnC

Following our previous interpretation, this corresponds to the following transducer:

0

0:00:C0:DA:0A:DB:0B:CB:D

1A:C

A:C

4

B:D

2A:D

3

B:C

0:C0:DA:0A:CA:DB:0B:CB:D

B:D

A:D

B:D

B:C

As Bird and Ellison suggest, this automaton will accept the following sequences, among others:

AC

BD

AC

BC

BD

AC

AD

BD

(11) Now consider the more complex associations among three different tiers, depicted in the fol-lowing chart:

46


A B C

D

E F

1

2

3

Clearly, the associations in this chart, though more complex, can be handled like the previousones.7

7Bird and Ellison do not provide an SLA diagram for this case, but introduce another notation:

tier 1 A:1:0:0 B:0:0:1 C:1:0:0tier 2 E:2:0:0 F:0:1:0tier 3 D:0:1:1

This notation is not fully explained, but I think it is supposed to indicate that on tier 1, A has 1 association to tier 2,B has 1 association to tier 3, and C has 1 association to tier 2; on tier 2, E has 2 associations to tier 1, and F has oneassociation to tier 3; and finally on tier 3, D has 1 association to tier 1 and 1 association to tier 2.This last notation is deployed in rather complex representations of rules.

47


5.2 A deep question: when two are like one

(12) Very brief summary:

a. rewrite rules as transducers Johnson (1972), . . . (ordering, cyclicity issues)b. two level rules as transducers Karttunen (1991), . . . (regularity lost in intersection)c. multiple levels as one Bird and Ellison (1994), . . .

i. Autosegmental structure, sd-sc, synchronization points represented by tuples, and theset of sequences of tuples is regular, and hence closed under intersection, complement.

Question: Computationally, this view is completely different from viewing the ma-chines as transducers, defining relations. So which perspective is appropriate?

ii. Autosegmental structure (and maybe even sd-sc, synchronization points) are repre-sented by regular sets. For example, when we say that

AC

AD

BD

is accepted, we really mean that any sequence of elements e1e2e3 is accepted if e1 ∈(A∩ C), e2 ∈ (A∩D), e3 ∈ (B ∩D). So then we accept e1 because it is both A and C,and the process ceases to look like a transduction.

Same Question: Suppose e1 is in A because of one of its properties and in C becauseof another of properties. (To push the point to the limit, maybe e1 is in A becauseit is a pair whose first element is A, and it is in C because its second element isC .) Then suddenly this problem looks like a transduction again, but one where wecan focus on the sequences of elements rather than on the relations “projected” bydescribing the sequences in the form of pairs A

CAD

BD .

Is it appropriate to focus on the regularity of the sequences rather than on relations“projected” by the structures of the elements of the sequences. (Specific senses of“project” will be considered below.)

The question here is fundamental: two-level machines can define intractable problems; finite stateacceptors are at the opposite extreme, defining only problems that can be solved in linear time. Yet itappears that Karttunen (1991) uses two level machines to handle


while Bird and Ellison (1994) propose finite state acceptors for the the same thing.

(13) Bird and Ellison (1994, p88) address this aspect of their proposal:

We have seen that Kornai (1994) finds it necessary to choose between the imposition ofrestrictions on autosegmental phonology and the loss of finite stateness in the transductionrelationship. As it turns out, the one-level approach does not suffer from this problem. Inthis section, we explain why.

Note that the natural processes by which finite-state automata are combined, and there-fore by which regular languages are manipulated, are not themselves regular. To see whythis is so, suppose we have two regular expressions describing the first form and the root ofthe Arabic verb to write:

48


50. C V C V C

k (•∗ t)+ •∗ b

The intersection is the following regular expression:

51. k V t V b

The associations fixing the incidence of k with the first consonant slot, t with the third, andb with the final, are made by the intersection operation. The question arises as to how wecan construct the associations if the same operation for Kornai’s system is not regular. Theoperation we have applied here – intersection – cannot be performed by a regular transducer.This does not invalidate our claim to regularity. What is regular in our theory is each individualdescription and generalization about phonological data. That is, the descriptions we use areall regular descriptions of phonological objects.

OK, forget about the intersection operation. Why are the results of intersection of tuples regular,since the intersection of regular transductions are not always regular?

What is not regular in one-level phonology is the relationship between different formatsof the same description. There is no finite-state transducer that will form the product of tworegular expressions. Multilevel analyses necessarily seek to capture relationships betweendifferent descriptions, and like the product operation, these relationships cannot be capturedby finite-state transducers.

What is meant here by “different formats of the same description”?Letting the nasals N = m,n and labials L = m,b, then N ∩ L = m. Regular automata can,in effect, equate the descriptionsN ∩ L and m, since there is a machine AN that accepts just N ,and a machine AL that accepts just L, and L(AN AL) = N ∩ L = m.

It is worth understanding these issues. A first step is to formulate the questions clearly.

(14) The basic mathematical results show the set of finite acceptors is closed under intersection(even when transitions are labeled with tuples), while the set of finite transducers (=acceptorswith transitions labeled by pairs) is not. So one clear question can be formulated this way:

Consider a finite acceptor A of sequences of pairs. We can think of the automaton asdefining a relation:

RL(A) = (x,y)| x = a1 . . . an, y = b1 . . . bn for some (a1, b1) . . . (an, bn) ∈ L(A)

Now consider two finite acceptors of pairs A,B. When is RL(A)∩L(B) = RL(A) ∩ RL(B)?

The answer to this question will bear on at least interpretation 12c-i of Bird and Ellison (1994), andmore generally it will bear on all theories that attempt to implement correspondences betweenlevels (tiers, underlying-surface, . . . ) with finite acceptors of tuples.It may illuminate interpretation 12c-ii to, as we will see.

(15) Example: Here is a familiar case where RL(A)∩L(B) = RL(A)∩RL(B), which we already consideredfor an (apparently) different reason.

A:

0

a:b

1

ε:c

ε:c

B:

0 1a:c

a:cε:bA B:

0

49


L(A) = (ε, ε), (ε, c), (ε, c)(ε, c), . . . ,(a, b), (a, b)(ε, c), (a, b)(ε, c)(ε, c), . . . ,(a, b)(a, b), (a, b)(a, b)(ε, c), (a, b)(a, b)(ε, c)(ε, c), . . .

RL(A) = (an, bnc∗)| n ≥ 0

L(B) = (ε, ε), (ε, b), (ε, b)(ε, b), . . . ,(a, c), (ε, b)(a, c), (ε, b)(ε, b)(a, c), . . . ,(a, c)(a, c), (ε, b)(a, c)(a, c), (ε, b)(ε, b)(a, c)(a, c), . . .

RL(B) = (an, b∗cn| n ≥ 0

L(A)∩ L(B) = (ε, ε) = L(A B)RL(A)∩L(B) = (ε, ε)

RL(A) ∩ RL(B) = (an, bncn)| n ≥ 0

We considered these automata earlier, not because RL(A)∩L(B) = RL(A) ∩ RL(B), but because theintersection of the transducers fails to preserve regularity.

Now we see that the intersection of the sets of sequences of pairs gives us input-output rela-tions that can be different from the intersection of the input-output relations defined by therespective automata.

At this point we should wonder: Are the cases where RL(A)∩L(B) = RL(A)∩RL(B) exactly the sameas the cases where RL(A)∩RL(B) is not regular? No. (soon to become obvious, if not already so)

(16) Example: Here is a case where RL(A)∩L(B) = RL(A) ∩ RL(B).

A:

0

a:bb:a

B:

S Ba:b

b:a

L(A) = (ε, ε), (a, b), (b,a), (a, b)(a, b), (a, b)(b,a), (b,a)(a, b), (b,a)(b,a), . . . RL(A) = (xn,yn)| either n = 0 or for all 0 < i ≤ n, xi,yi ∈ a,b, xi = yi

L(B) = (ε, ε), (a, b)(b,a), (a, b)(b,a)(a, b)(b,a), . . . RL(B) = ((ab)n, (ba)n)| n ≥ 0

(L(A)∩ L(B)) = L(A B) = L(B)RL(A)∩L(B) = RL(B)RL(A) ∩ RL(B) = RL(B)

(17) Onemore example: Another case whereRL(MA)∩L(MB) = RL(MA)∩RL(MB). (We call ourmachinesMA, MB in this example just so that we can avoid confusion with sets that are named A, B.)Consider the “synchronized” SLA is obtained as an intersection:

50


A A A> >

B B B

Bird and Ellison (1994) propose that this is obtained by an SLA intersection which intersectsstate labels and transitions

> >A A A

B B B

So let’s imagine that A and B are finite sets. Suppose A = e1, e2 and B = e2, e3.MA:

0

e1:0e2:0

1e1:1

e2:1

e1:0e2:0

MB:

0

e2:0e3:0

1e2:1

e3:1

e2:0e3:0

L(MA) = (e1,1), (e1,1)(e1,0), (e1,1)(e2,0), (e1,1), (e1,0), (e1,0), . . . ,(e2,1), (e2,1)(e2,0), (e2,1)(e2,0), (e2,1), (e1,0), (e1,0), . . . ,(e1,0)(e1,1), (e2,0)(e1,1), (e1,0)(e2,1), (e1,0)(e1,1)(e1,0), . . .

L(MA) = (A,0)∗(A,1)(A,0)∗RL(MA) = (An,0i10j)| n > 0, i+ j = n− 1

L(MB) = (e2,1), (e2,1)(e2,0), (e2,1)(e3,0), (e2,1), (e2,0), (e2,0), . . . ,(e3,1), (e3,1)(e3,0), (e3,1)(e3,0), (e3,1), (e2,0), (e2,0), . . . ,(e2,0)(e2,1), (e3,0)(e2,1), (e2,0)(e3,1), (e2,0)(e2,1)(e2,0), . . .

L(MB) = (B,0)∗(B,1)(B,0)∗RL(MB) = (Bn,0i10j)| n > 0, i+ j = n− 1

L(MA)∩ L(MB) = (e2,1), (e2,1)(e2,0), (e2,1)(e2,0), (e2,1), (e2,0), (e2,0), . . . ,(e2,0)(e2,1), (e2,0)(e2,1)(e2,0), . . .

L(MA)∩ L(MB) = (e2,0)∗(e2,1)(e2,0)∗RL(A)∩L(B) = (e2n,0i10j)| n > 0, i+ j = n− 1

RL(A) ∩ RL(B) = (e2n,0i10j)| n > 0, i+ j = n− 151


Calculating the intersection of MA,MB as acceptors, we of course obtain:

MA MB:

0

e2:0

1e2:1

e2:0

This is a possible representation of the simple association below.

... A ...

... B ...

This representation differs from the one we had earlier in two respects: (i) we do not accept arbitrarysymbols from the initial and final state, and (ii) sequences like the following are regarded as sequencesof simple elements in intersections:

AC

BC

BD

I think DMA is right that, w.r.t. (ii), MA MB is closer to what Bird and Ellison (1994) intended. I leaveaside the question of what to do about (i).

NB: on the simple approach sketched here, there is no requirement that the element of B∩C that occursin the second position is the same element of B that occurs in the third position. This threatens the ideathat a sequence of A’s can be regarded as a single segment with some duration in time. Let’s leave thisaside for the moment and return to the main line of reasoning.

Question: Are the intersections which are done in Bird and Ellison (1994) all ones in whichRL(A)∩L(B) = RL(A) ∩ RL(B)?First let’s try to characterize a broad range of cases where this equality holds.

(18) Lemma: Consider finite automata A,B where ΣA and ΣB are finite alphabets of pairs.That is ΣA ⊆ ΣA1 × ΣA2 for some finite ΣA1 ,Σ

A2 , and similarly for ΣB .

These are finite transducers. We already have these basic facts:

a. L(A)∩ L(B) is always regular(Regular languages are closed under intersection.)

b. L(A B) = (L(A)∩ L(B))(This is the basic result about computing intersections with the “product” machines.)

c. RL(A), RL(B), RL(A)∩L(B) = RL(A B) are always finite transductions(By the definition of transducer)

d. it can happen that RL(A) ∩ RL(B) is not regular(We saw this in example 15)

e. it can happen that RL(A)∩L(B) = RL(A) ∩ RL(B).(We saw this in example 15)

We want to clarify 18d and 18e: when do these things happen?

(19) A transducer A is same length (SL) iff for every transition (qA0 , a, b, q

A1 ) ∈ δA, |a| = |b|.

(20) Lemma: If x : y labels a path in an SL transducer, then |x| = |y|.(21) Lemma: If transducers A,B are SL, so is A B.

52


(22) Theorem: Consider any two SL transducers A,B where Σ1 = ΣA1 ∩ ΣB1 and Σ2 = ΣA2 ∩ ΣB2.For any qA

0 , qAi ∈ QA, any qB

0 , qBj ∈ QB , x ∈ Σ∗1 , y ∈ Σ∗2 ,

x : y labels a path from qA0 to qA

i in A and a path from qB0 to qB

j in B iff x : y labels a path from

(qA0 , q

B0 ) to (qA

i , qBj ) in A B.

(23) Example: Consider example (15) again:

A:

0

a:b

1

ε:c

ε:c

B:

0 1a:c

a:cε:bA B:

0

The transducers A,B are not SL. Notice that (a, bc) labels a path in A and in B but not in A B.(24) Proof of (22):

(⇒) Assume x : y labels a path from qA0 to qA

i in SL transducer A and a path from qB0 to qB

j in

SL transducer B. Show that x : y labels a path from (qA0 , q

B0 ) to (qA

i , qBj ) in A B.

We use an induction on the length of |x| (and we know |x| = |y| by lemma 20).(|x| = 0) By the definition of path, for any state qA

0 ∈QA there is a 0 length path from qA0 to qA

0labeled (ε, ε). And for any state qB

0 ∈ QB there is a 0 length path from qB0 to qB

0 labeled (ε, ε).By the definition of , for any such qA

0 , qB0 there is a state (qA

0 , qB0 ) ∈ A B, and there is a 0

length path from (qA0 , q

B0) to (qA

0 , qB0) labeled (ε, ε).

(IH) The result holds for |x| ≤ k.Assume there is a path labeled x : y from qA

0 to qAi in A and a path from qB

0 to qBj in B, where

|x| = |y| = k+ 1. We must show that x : y labels a path from (qA0 , q

B0 ) to (qA

i , qBj ) in A B.

Since |x| = |y| = k + 1, there are a ∈ Σ1, b ∈ Σ2 such that x = x′a and y = y ′b. Since A,Bare SL, x′ : y ′ labels a path from qA

0 to qAi−1 in A and a path from qB

0 to qBi−1 in B. That means:

(qAi−1, a, b, q

Ai ) ∈ δA (†)

(qBi−1, a, b, q

Bi ) ∈ δB.

That is, the transition that accepts a must also output b since these machines are SL.

Since |x′| ≤ k and x′ : y ′ labels a path from qA0 to qA

i−1 and from qB0 to qB

i−1, by the IH, thereis a path from (qA

0 , qB0) to (qA

i−1, qBi−1) in A B. But then by (†) and the definition of , x : y

labels a path from (qA0 , q

B0 ) to (qA

i , qBj ) in A B.

(⇐) This direction is trivial because the machine A B explicitly provides the paths we needto find in A,B. That is, assume x : y labels a path from (qA

0 , qB0 ) to (qA

i , qBj ) in A B.

This means that A B has a path:

((qA0 , q

B0), a1, b1, (q

A1 , q

B1)), . . . , ((q

Ai−1, q

Bi−1), ai, bi, (qA

i , qBj ))

where

a1 . . . ai = x andb1 . . . bi = y.

53


Then by the definition of there are paths

(qA0 , a1, b1, q

A1 ), . . . , (q

Ai−1, ai, bi, qA

i ) in A and(qB

0 , a1, b1, qB1 ), . . . , (q

Bi−1, ai, bi, qB

j ) in B.

(25) Theorem: If A,B are such that the following condition holds, then RL(A)∩L(B) = RL(A) ∩RL(B):

(1) x : y labels a path from qA0 to qA


j in B iff x : y labels a path

from (qA0 , q

B0) to (qA

i , qBj ) in A B.

(26) Proof: Suppose A,B are such that (1) holds.

(⊆) This inclusion follows trivially from our definitions. Suppose x : y ∈ RL(A)∩L(B). By thedefinition of RL(A)∩L(B), it follows that there is some sequence (a1, b1) . . . (an, bn) in both L(A)and L(B) where a1 . . . an = x and b1 . . . bn = y .So then by the definition of RL(A) and RL(B), x : y ∈ RL(A) and x : y ∈ RL(B). It follows thatx : y ∈ RL(A) ∩ RL(B).

(⊇) Suppose x : y ∈ RL(A)∩RL(B). This means that there is some sequence (aA0 , b

A0 ) . . . (a

Ai , b

Ai )

in L(A) such thataA0 . . . aA

i = x and bA0 . . . bA

i = y , and there is some sequence (aB0 , b

B0) . . . (a

Bj , b

Bj )

in L(B) such that aB0 . . . a

Bj = x and bB

0 . . . bBj = y .

That is, x : y labels a successful path from qA0 to qA

i in A and a successful path from qB0 to qB

jin B.By the definition of successful paths and , (qA

0 , qB0 ) is an initial state and (qA

i , qBj ) a final state

in A B.Since A,B respect (1)„ x : y labels a successful path from (qA

0 , qB0 ) to (qA

i , qBj ) in A B.

So by Lemma (18b) and the definition of R, x : y ∈ RL(A)∩L(B).

(27) Corollary: (Kaplan and Kay, 1994) If A,B are SL transducers, then RL(A)∩L(B) = RL(A) ∩ RL(B).

Proof: Immediate from Theorems (22) and (25).

(28) The situation so far: In phonology or any other application of finite automata, there are threeor four different ways to proceed:

a. use only acceptors of atomic symbols (whatever structures these elements might have isnot “projected”)

b. use acceptors of pairs (or tuples or other structured elements) but make sure that thesehave properties like SL which guarantee that RL(A)∩L(B) = RL(A) ∩RL(B).

c. use acceptors of pairs (or tuples or other structured elements) where sometimesRL(A)∩L(B) =RL(A) ∩ RL(B), but pay no attention to the possibly complex relations RL(A) ∩ RL(B).(This is really the same as the first option.)

d. Define and intersect machines in order to obtain possibly complex relations RL(A) ∩ RL(B).This is the two-level automata approach.

(29) In phonology, to take the second approach, the thing that we need to watch is insertions anddeletions since they remove the SL property. Little is said about insertions and deletions inBird and Ellison (1994), but we will need to consider these carefully.

54


(30) Given the importance of deletions and insertions in phonology, it is very important to noticethat the SL condition is sufficient for RL(A)∩L(B) = RL(A) ∩ RL(B), but not necessary.

That is, the converse to Corollary 27 does not hold. There are transducers A,B which are notSL, where nevertheless RL(A)∩L(B) = RL(A) ∩ RL(B).

A trivial case is provided by the intersection of any transducerAwith itself; trivially,RL(A)∩L(A) =RL(A) ∩ RL(A) = RL(A).

However, there are nontrivial cases too.

(31) Example: Let’s modify the earlier example (16) so that the machines are not SL(e in the graphs is ε):

A:

0

b:aa:ε

B:

0 1b:a

a:ε

L(A) = (ε, ε), (a, ε), (b,a), (a, ε)(a, ε), (a, ε)(b,a), (b,a)(a, ε), (b,a)(b,a), . . . RL(A) = (a∗(ba∗)n, an)| n ≥ 0

L(B) = (ε, ε), (a, ε)(b,a), (a, ε)(b,a)(a, ε)(b,a), . . . RL(B) = ((ab)n,an)| n ≥ 0


(32) Transducers A,B are consistently labeled (CL) iff whenever x : y labels a path from qA0 to qA

iand from qB

0 to qBj ,

for any a ∈ Σε1, bA, bB ∈ Σε2,if (qA

i , a, bA, qA

i+1) ∈ δA and (qBj , a, b

B, qBj+1) ∈ δB , then |bA| = |bB|.

(33) The machines A,B in (15) are not CL, but the machines A,B in (16), (17) and (31) are.

(34) Notice that that CL is a binary relation among transducers. In fact, it is an equivalence relation.

(35) Conjecture: Consider any two CL transducers A,B where Σ1 = ΣA1 ∩ ΣB1 and Σ2 = ΣA2 ∩ ΣB2.For any qA

0 , qAi ∈ QA, any qB

0 , qBj ∈ QB , x ∈ Σ∗1 , y ∈ Σ∗2 ,

x : y labels a path from qA0 to qA


j in B iff x : y labels a path from

(qA0 , q

B0 ) to (qA

i , qBj ) in A B.

(36) Before attempting to establish this conjecture, we can observe immediately that is does not getus everything we want. There are many cases where we want to intersect non-CL transducers.

55


(37) Consider this simple example of transducers over Σ1 = Σ2 = k,v, where we have a transducerthat says anything in Σ can change change into anything in Σε, and a second transducer thatsays k must be deleted.

A:

0

k:kk:v

v:vv:kv:ε

k:ε

B:

0

v:vv:kv:ε

k:ε

A B:

0

v:vv:kv:ε

k:ε

These same machines could be represented in the following abbreviated form:

A:

0

Σ : ΣεB:

0

k:εΣ− k : Σε

A B:

0

k:εΣ− k : Σε

L(A) = (x,y)n| n ≥ 0, x ∈ Σ, y ∈ ΣεRL(A) = (k,vn, k,vm)|m < n

L(B) = (x,y)n| n ≥ 0, and either x ∈ (Σ− k), y ∈ Σε, or x = k,y = εRL(B) = (k∗(vk∗)n, vn)| n ≥ 0


So these transducers are neither SL nor CL, and yet RL(MA)∩L(MB) = RL(MA) ∩ RL(MB).

(38) Suppose that elaborate the previous examples to allow arbitrary insertions in both transducersA,B. It appears that we still have RL(MA)∩L(MB) = RL(MA) ∩ RL(MB).

56


(39) These last examples are similar to the following more complex example from Karttunen (1991):

in Finnish consonant gradation, intervocalic k generally disappears in the weak grade.However, between two high labial vowels k is realized as v. Consequently, the genitiveof maku ‘taste’ is maun but the genitive of puku ‘dress’ is puvun.

He proposes that this generalization be captured with the following two-level rules (though henotes that the context specifications here are not quite adequate):

a. i. k:v u _ u C [#: | C]ii. k:ε | k:v ⇐ V _ V C [#: | C]

The latter rule says that intervocalic k must either be deleted or realized as v. (And rememberthat for Karttunen, when contextual forms are not otherwise specified, they are assumed to belexical, underlying forms.)

Do we need two level machines for this kind of case??

To focus on just this question, let C = k,m,v, V = a,u, Σ1 = Σ2 = (C ∪ V).and simplify these rules to the following:

b. i. k:v u: _ u:

ii. k:ε | k:v ⇐ V: _ V:

These rules can be represented by transductions, but these transductions are neither SL norCL. We have assumed that these transducers, depicted in abbreviated form, are something likethe following:8

i:

0

12

k:v

3

k : Σε − v

k : Σε − v

Σ−uk : Σε

Σ−uk : Σεu : Σε

u : Σε

u : Σε

Σ−u : Σε

ii:

01

2

k:e

k:v

3

Σ− V : Σε

Σ− V : Σε

Σ− V : ΣεV : ΣεV : Σε

V : Σε k : Σ− v

Σ− V + k : Σε

We can compute i ii, but the result is complex. To check the result, we can create the identitytransducers for some of the inputs that we are interested in, and then compose these with theintersected transducer

8Note that we have allowed for arbitrary deletions but not arbitrary insertions here. Allowing arbitrary insertions, theinput mk could lead to the output maku, since neither of the rules given above would apply. For the moment, let’s stickwith the simpler case described above.

57


ptt(maku):0 1

m:m2

a:a3

k:k4

u:u

ptt(muku):0 1

m:m2

u:u3

k:k4

u:u

ptt(maku)(i ii):

0 1

m:k

m:m

m:v

m:a

m:u

2

a:u

a:k

a:m

a:v

a:a

3 4

u:k

u:m

u:v

u:a

u:u

m : ε

a : ε

k : ε

u : ε

ptt(muku)(i ii):

0 1

m:k

m:m

m:v

m:a

m:u

2

u:u

u:k

u:m

u:v

u:a

3k:v

4

u:k

u:m

u:v

u:a

u:u

m : ε u : ε

u : ε

So although the machine i ii is too complex to assess directly, we see that it is doing what wewant on these inputs, even though i and ii are neither SL nor CL.

So again: is it safe to use rather than two-level machines for these automata?The fact that i and ii are neither SL nor CL does not suffice to show that RL(i)∩L(ii) = RL(i)∩RL(ii).

My conjecture is: RL(i)∩L(ii) = RL(i) ∩ RL(ii). How can we show this?

(40) Lemma: For all transducers A,B, RL(A)∩L(B) ⊆ RL(A) ∩ RL(B).

Proof: This is the easy direction.

Assume x : y ∈ RL(A)∩L(B). Then by the definition of R, there are (a0, b0) . . . (an, bn) ∈ L(A)∩L(B) such that a0 . . . an : b0 . . . bn = x : y . But then (a0, b0) . . . (an, bn) ∈ L(A) sox : y ∈ RL(A).and (a0, b0) . . . (an, bn) ∈ L(B) so x : y ∈ RL(B).

58


(41) Successful paths PA in transducer A and PB in transducer B are conspiratorial iff the followingconditions hold:

a. both paths are labeled xax′ : yby ′

b. there is an initial segment of PA labeled x : y going from initial state qA0 to qA

i in A,and there is an initial segment of PB labeled x : y going from from initial state qB

0 to qBj in

B such that, for some a ∈ Σ1 at least one of the following conditions holds:i. for some b ∈ Σ2, (qA

i , ε, b, qAi+1) ∈ δA and

(qBi , a, b, q

Bj+1) ∈ δB and

the rest of the path in A labeled ax′ : y ′ goes from qAi+1 to a final state and the rest of

the path in B labeled x′ : y ′ goes from qBj+1 to a final state, as shown here:

...x:y

...

...x:y

...

a:b x’:y’

ax’:y’ε : b

ii. (qAi , a, ε, q

Ai+1) ∈ δA and

for some b ∈ Σ2, (qBi , ε, b, q

Bj+1) ∈ δB and

the rest of the path in A labeled x′ : by ′ goes from qAi+1 to a final state and the rest of

the path in B labeled x′ : y ′ goes from qBj+1 to a final state, as shown here:

x’:by’

...x:y

...

...x:y

...

a:b x’:y’

a : ε

iii. (qAi , a, ε, q

Ai+1) ∈ δA and

for some b ∈ Σ2, (qBi , a, b, q

Bj+1) ∈ δB and

the rest of the path in A labeled x′ : by ′ goes from qAi+1 to a final state and the rest of

the path in B labeled ax′ : y ′ goes from qBj+1 to a final state, as shown here:a:e x’:by’

e:b ax’:y’

...x:y

...

...x:y

...

Transducers A,B are non-conspiratorial (NC) iff they have no conspiring paths.

(42) Lemma: If transducers A,B are SL or CL, they are NC.

(43) Transducers A,B in (15) are conspiratorial.Transducers i, ii in (39), and the transducers A,B in (37), (16), (17) and (31) are all NC.

59


(44) Theorem: If transducers A,B are NC, RL(A)∩L(B) = RL(A) ∩ RL(B).

Proof:

(⊆) By Lemma (40).(⊇) Assume xs : ys ∈ (RL(A) ∩RL(B))− (RL(A)∩L(B)), and we will show that this yields a contra-diction.

Since xs : ys ∈ (RL(A)∩RL(B)), there must be a successful path inA and a successful path in B la-beledxs : ys . Let the labels of the transitions in any such successful paths be (aA

0 , bA0 ) . . . (a

Ai , b

Ai ) ∈

L(A) and (aB0 , b

B0 ) . . . (a

Bj , b

Bj ) ∈ L(B), whereaA

0 . . . aAi = aB

0 . . . aBj = xs , and bA

0 . . . abi = bB

0 . . . bBj =

ys .

But since xs : ys ∈ (RL(A)∩L(B)), it must be that some (aAk , b

Ak ) = (aB

k, bBk) for 0 ≤ k ≤ i, j.

Consider the first place (the least k where this happens. It cannot be that aAk ,a

Bk ∈ Σ1, bA

k , bBk ∈

Σ2 because then the two successful paths would not have the same label xs : ys . At least oneof aA

k ,aBk, b

Ak , b

Bk is empty.

We can assume w.l.o.g. that neither A,B have transitions labeled ε : ε, so the possibilities are:

a. i. aAk is empty, aB

k, bAk , b

Bk are not;

ii. aBk is empty, a

Ak , b

Ak , b

Bk are not;

b. i. bAk is empty, aA

k ,aBk, b

Bk are not;

ii. bBk is empty, aA

k , bAk , a

Bk are not;

c. i. aAk , b

Bk are empty, bA

k ,aBk are not;

ii. bAk ,a

Bk are empty, a

Ak , b

Bk are not;

In each of a,b,c, the i and ii differ only in the naming of A and B, so we need consider only onefrom each of these pairs.

(case 44a-i) In this case, since the indicated transition occurs in a successful path labeled xs :ys , it must be the case that bA

k = bBk , the path from qA

k to qAi is labeled aB

kx′ : y ′, and the

path from qBk to qB

j is labeled x′ : y ′,(case 44b-i) In this case, since the indicated transition occurs in a successful path labeled

xs : ys , it must be the case that aAk = aB

k , the path from qAk to qA

i is labeled x′ : bBky

′, andthe path from qB

k to qBj is labeled x′ : y ′,

(case 44c-i) In this case, since the indicated transition occurs in a successful path labeled xs :ys , it must be the case that the path from qA

k to qAi is labeled x′ : bB

ky′, and the path from

qBk to qB

j is labeled aAkx

′ : y ′.

In all possible cases, then, A,B are conspiratorial, contradicting the hypothesis of the theorem.Our assumption that there is some xs : ys ∈ (RL(A) ∩ RL(B))− (RL(A)∩L(B)) must be false, andso RL(A)∩L(B) = RL(A) ∩RL(B).

(45) Notice that there is no class of NC transducers to be closed under intersection, since being NCis a binary relation on transducers.

(46) Theorem (44) establishes conjecture (35).

Exercise: Does it establish the conjecture at the end of (39)?

60


(47) The converse of 44 still does not hold. That is, there are conspiratorial transducers A,B suchthat RL(A)∩L(B) = RL(A) ∩RL(B).

Exercise: Provide an example to prove this.

The NC condition is sufficient but not necessary for RL(A)∩L(B) = RL(A) ∩ RL(B) – but the NCcondition is much more general than SL.

61


6 Optimality theory: first ideas

(1) Brief summary of previous discussion:

a. Nerode characterization of finite state languages

b. rewrite rules as transducers Johnson (1972), . . . (ordering, cyclicity issues)

c. two level rules as transducers Karttunen (1991), . . . (regularity lost in intersection)

d. multiple levels as one Bird and Ellison (1994), . . .The NC condition is sufficient (but not necessary) for RL(A)∩L(B) = RL(A) ∩ RL(B).

Following Ellison (1994a), Eisner (1997b), and Albro (1997), we can get quite a good implementationof a good part of optimality using finite state machines. The basic idea is that gen can be representedby a finite state machine, and many constraints of optimality can be represented by finite state trans-ducers.9 The tableau based reasoning can then be done rigorously by calculations on these machines.Here we sketch a simple account along these lines.

6.1 A simple example from Prince & Smolensky, §6

Inputs: C,V+Candidates: parses of sequences of syllables with the standard structure, but allowing arbitrary dele-

tions and insertions

Preference: given by some ranking of the constraints:

Ons: syllables must have onsets

NoCoda: syllables must not have codas

Fillnuc: a nucleus must be filled (by an input V)

Parse: segments of the underlying form must be parsed into syllabic positions

Fillons: an onset must be filled (by an input C)

Example 1 Given the ranking,

Ons >> NoCoda >> Fillnuc>> Parse >> Fillons .

the optimal parse of /VC/ is .V.〈C〉, as illustrated by the comparisons in the following table:

9The idea of using (string) transducers to represent constraints in optimality theory naturally extends to the idea ofusing tree transducers to represent constraints in syntax. This idea is very natural, and is hinted at in some formalizationsof syntactic theory (Stabler, 1992; Rogers, 1995), and is fully explicit in the work of Morwietz and Cornell (1997a).

62


/VC/ Ons NoCoda Fillnuc Parse Fillons

A

.V.〈C〉 * *

.VC. * *

.VC. * *

..〈VC〉 * * ** *#

Each constraint can be regarded as a function that maps syllable structures to natural numbers,numbers that indicate how many times the structure violates the constraint. Corresponding to eachconstraint, we can define a filter which applies to a set of syllable structures, yielding just the subset ofstructures which are optimal with respect to the constraint – that is, the structures which are mappedto the lowest value of any structures in the whole set.

Given a strict ranking of constraints,C1>>C2>> . . . >>Cn,

where each constraint Ci corresponds to a filter Fi, and given an input set gen(input), the optimalstructures are

Fn(. . . F2(F1(gen(input)))).

The input can be regarded as a filter on an initial setgen, sogen(input)will be given as an intersection(input ∩ gen). The constraints Ci will be given as transducers.10 And the filtering will then be donesimply by pruning suboptimal paths through the transducer, yielding a finite machine that has onlythe optimal paths of the transducer. Calling this pruning function bp (for “best paths”), a mappingfrom transducers to finite machines, the optimal structures are then exactly those that are acceptedby the finite machine:

bp(Cn ∩ . . . bp(C2 ∩ bp(C1 ∩ (input ∩ gen)))).

This construction repeatedly uses the standard definition of a transducer as the intersection of atransducer and a finite machine, which we repeat again here.

6.2 Rational transductions closed under intersecting their domainswith regular languages

Given a finite state transducer T and a finite state machine A, we can easily construct the finite statetransducer which defines the restriction of the transduction of T to the intersection Dom(T)∩A.

Given T = 〈Q1,Σ,Σ2, δ1, I1, F1〉 and A = 〈Q2,Σ, δ2, I2, F2〉,define T ′ = 〈Q1 ×Q2,Σ,Σ2, δ, I1 × I2, F1 × F2〉, where for all a ∈ Σ, b ∈ Σ2, q1, r1 ∈Q1, q2, r2 ∈ Q2,

([q1, q2], a, b, [r1, r2]) ∈ δ iff (q1, a, b, r1) ∈ δ1 and (q2, a, r2) ∈ δ2.

NB: to execute this intersection, it is important to keep in mind the “0-step path” that we have in ourdefinition of finite automata: intuitively, there is a path from every state to itself accepting the emptystring.10In the present use, these transducers can also be viewed as weighted finite acceptors.

63


6.3 Gen

We can write a right branching grammar for sequences of syllables, and allowing for the possibleinsertions and deletions. We will treat inserted elements and deleted elements 〈C〉, 〈V〉 as singlesymbols in this grammar.

gen→ ε o → C r r → V c c → C end end → . o stop → εgen→ . o o → r r → V end

o → r r → c c → end end → . stopr → end

o → 〈C〉 o r → 〈C〉 r c → 〈C〉 c end → 〈C〉 endo → 〈V〉 o r → 〈V〉 r c → 〈V〉 c end → 〈V〉 end

As observed earlier, when the grammar is in this form, the grammar transparently defines a cor-responding finite machine, where the categories are the states of the machine, the start symbol of thegrammar is the start state of the machine, the categories with empty expansions are the final states ofthe machine, and the non-empty productions are exactly the state transitions allowed by δ. Here andbelow we will regard grammars given in this form as finite machines. So in the grammar above, genis a final state, and all the other rules have the binary, right-branching form Cat1 → a Cat2 exceptfor the rule o → r . This latter rule can be regarded as an ε-transition, as we see in the following finiteautomaton:

gen o.

<C><V>

r

C

[]

<C><V>

cV

[]

endV

[]

<C><V>

C

[]

.

<C><V>

stop.

ε

We can eliminate the ε-transition o → r without changing the language recognized if we replace thisrule by the six rules that expand o in all the ways that r can be expanded. So the grammar we will useis the following, where the start category is gen:

64


gen→ ε o → C r r → V c c → C end end → . o stop → εgen→ . o o → V c r → V end end → . stop

o → V end

o → r r → c c → endo → c r → endo → end

o → 〈C〉 o r → 〈C〉 r c → 〈C〉 c end → 〈C〉 endo → 〈V〉 o r → 〈V〉 r c → 〈V〉 c end → 〈V〉 endo → 〈V〉 ro → 〈C〉 r

This corresponds to the following acceptor:

gen o.

<C><V>

r

C

<V>

[]

<C>

cV

[]

endV

[]

<C><V>

V

[]

V

[]

<C><V>

C

[]

.

<C><V>

stop.

Notice that this automaton is not deterministic, even when the empty transition is eliminated. Theelimination of the empty transition introduces two ways to leave o with a deletion (〈V〉 or 〈C〉), andthere are from the previous automaton already two ways to leave end with a dot. Converting this to aminimal, deterministic machine, the result is slightly less intuitive, and has the same number of states,so we will stick with this one for the moment.

6.4 Input

The input that we want to associate with a structure can be represented by a finite state grammar thatincludes all the possible ways to insert and delete material. So, for example, the input /VC/ can berepresented by the following grammar, where the start category is in:

65


in→ V i1 i1→ C i2 i2→ εin→ 〈V〉 i1 i1→ 〈C〉 i2in→ . in i1→ . i1 i2→ . i2in→ in i1→ i1 i2→ i2

This is the machine:

in

.[]

i1V

<V>

.[]

i2C

<C>

.[]

Notice that the language defined by this Input machine is infinite. It includes not only odd thingslike ....V..C, but also legitimate syllable structures like:

.VC. .〈V〉.C. .V〈C〉.

The language defined by the Input machine does not include expressions of category gen that do notinclude the input symbols, in order, though. So, for example, the following expressions do not havethe category in:

.CVC. .〈C〉VC. .VC.VC. .CV.

We can establish these facts by showing for example, that ptt(.VC.) Input=ptt(.VC.), whereasptt(.CVC.) Input=∅.

6.5 Gen(Input) = Input ∩ Gen

Since the Input machine has 3 states and Gen has 6 states, the intersection machine has 18. We canrepresent it in grammatical form as follows, where the state [gen, in] is now the start category, andstates [stop, in] and [stop, i1] have no transitions, since stop doesn’t:

66


[gen, in]→ . [o, in] [gen, i1]→ . [o, i1] [gen, i2]→ . [o, i2][gen, i2]→ ε

[o, in]→ V [c, i1] [o, i1]→ C [r , i2] [o, i2]→ [r , i2][o, in]→ V [end, i1] [o, i1]→ [r , i1] [o, i2]→ [c, i2][o, in]→ [r , in] [o, i1]→ [c, i1] [o, i2]→ [end, i2][o, in]→ [c, in] [o, i1]→ [end, i1][o, in]→ [end, in] [o, i1]→ 〈C〉 [o, i2][o, in]→ 〈V〉 [o, i1] [o, i1]→ 〈C〉 [r , i2][o, in]→ 〈V〉 [r , i1]

[r , in] → V [c, i1] [r , i1] → [c, i1] [r , i2] → [c, i2][r , in] → V [end, i1] [r , i1] → [end, i1] [r , i2] → [end, i2][r , in] → [c, in] [r , i1] → 〈C〉 [r , i2][r , in] → [end, in][r , in] → 〈V〉 [r , i1]

[c, in]→ [end, in] [c, i1]→ C [end, i2] [c, i2]→ [end, i2][c, in]→ 〈V〉 [c, i1] [c, i1]→ [end, i1]

[c, i1]→ 〈C〉 [c, i2]

[end, in] → . [o, in] [end, i1] → . [o, i1] [end, i2] → . [o, i2][end, in] → . [stop, in] [end, i1] → . [stop, i1] [end, i2] → . [stop, i2][end, in] → 〈V〉 [end, i1] [end, i1] → 〈C〉 [end, i2]

[stop, i2]→ ε

(A couple of “dead states” are left in this representation. Notice for example that [gen, i2] is a finalstate, but there are no transitions to it.) Computing the transducer and pruning away dead states, weget

0 1.

2

[]3

[]

4

[]

8

V

5V

6

<V>

7

<V>

[]

[]

V

V <V>

[]

<V>

.

<V>

[]

9C

12<C>

.

<C>[]

[]

[] 11

C

<C>

10

<C>

[]

[]

<C>

.

13

.[]

[]

[][]

[]

[]

67


Clearly this machine still defines an infinite language. We can see that it accepts the examples men-tioned in the previous section,

.VC. .〈V〉.C. .V〈C〉.but does not accept:

.CVC. .〈C〉VC. .VC.VC. .CV.

6.6 Ons

We can represent Ons with the finite state machine for syllables, except that we associate weights witheach transition. All transitions have weight 0 except those that allow an empty onset:

gen→ ε o → C r r → V c c → C end end→ . o stop → εgen→ . o o 1

→ V c r → V end end→ . stopo 1→ V end

o → r r → c c → endo 1→ c r → end

o 1→ end

o → 〈C〉 o r → 〈C〉 r c → 〈C〉 c end→ 〈C〉 endo → 〈V〉 o r → 〈V〉 r c → 〈V〉 c end→ 〈V〉 endo 1→ 〈V〉 r

o 1→ 〈C〉 r

This is just a weighted version of the gen machine:

gen/0 o./0

<C>/0<V>/0

r

C/0

<V>/1

[]/0

<C>/1

cV/1

[]/1

endV/1

[]/1

<C>/0<V>/0

V/0

[]/0

V/0

[]/0

<C>/0<V>/0

C/0

[]/0

./0

<C>/0<V>/0

stop/0./0

68


This successful path labeled .VC. has weight 1, the path .VC.VC. has weight 2, and the path .VC.VC.VC.has weight 3. On the other hand, the successful path .VC. has weight 0, as does .V〈C〉.. Comparethe first column of the tableau in §4.1.

These weights can be calculated by, for example, computing pt(.VC.) Ons:

0 1./0

2V/1

3C/0

4/0./0

69


6.7 Ons(Gen(Input)) = BestSuccessfulPaths(Ons ∩ (Input ∩ Gen))

Using the method of §6.2, we can intersect Ons with (Input ∩ Gen). Since Gen and Ons are isomorphic,this intersection yields a machine of the same size and structure as (Input ∩ Gen); Ons simply addsweights to certain transitions.


[o, in] 1→ V [c, i1] [o, i1]→ C [r , i2] [o, i2]→ [r , i2]

[o, in] 1→ V [end, i1] [o, i1]→ [r , i1] [o, i2] 1

→ [c, i2][o, in]→ [r , in] [o, i1] 1

→ [c, i1] [o, i2] 1→ [end, i2]

[o, in] 1→ [c, in] [o, i1] 1

→ [end, i1][o, in] 1

→ [end, in] [o, i1]→ 〈C〉 [o, i2][o, in]→ 〈V〉 [o, i1] [o, i1] 1

→ 〈C〉 [r , i2][o, in] 1

→ 〈V〉 [r , i1]



[c, i1]→ 〈C〉 [c, i2]


[stop, i2]→ ε

The calculated machine is the following:

70


0 1./0

2

[]/03

[]/1

4

[]/1

8

V/1

5V/1

6

<V>/0

7

<V>/1

[]/0

[]/0

V/0

V/0 <V>/0

[]/0

<V>/0

./0

<V>/0

[]/0

9C/0

12<C>/0

./0

<C>/0[]/1

[]/1

[]/0 11

C/0

<C>/1

10

<C>/0

[]/0

[]/0

<C>/0

./0

13/0

./0

[]/0

[]/0

[]/1[]/0

[]/1

[]/0

After obtaining the transducer Ons ∩ (Input ∩ Gen), we can use Dijkstra’s simple “single sourceBest Paths” algorithm (Dijkstra, 1959) to identify the cost of the best paths, and then we can pruneaway all suboptimal successful paths.

Given an n-node graph, the Dijkstra’s algorithm builds an n-cell table containing the costs of thebest paths from the source node in the following way:

Given a graph with nodes V and start node S, we begin with just the start node S and tabulatethe costs of the steps to immediately adjacent nodes. Non-adjacent nodes are counted as havinginfinite cost. Then, we take the “closest” node S1 among the nodes in V-S and tabulate thecosts of the nodes adjacent to S1, and update the minimum costs of getting from the startnode to all the nodes adjacent to S1. Then we choose the closest node S2 in the set V-S,S1and tabulate the minimum costs of getting from start to nodes adjacent to S2, and so the untilwhole graph V has been explored.

The correctness of this method is not completely obvious! See Aho, Hopcroft, and Ullman (1974,§5.10) or Cormen, Leiserson, and Rivest (1991, §25.2) for proofs of soundness and complexity results.Obviously, this method only works when all costs are non-negative. It turns out that the complexityof this algorithm is On2, and obviously since it builds an array of length n, it is not a finite statecomputation.

Our representation of Ons ∩ (Input ∩ Gen) has 16 states, and we sum the weights along any pathfrom an initial state to a final state.

It is obvious what the result of eliminating the non-optimal paths will be, but it is worth using thealgorithm so that we will understand how it works. The algorithm will work properly on cases wherethe outcome is not obvious!

Since the machine has 16 states, the algorithm will build a 16-column table representing the bestpaths from the start to each of those fifteen states. In order to be able to present the results on asingle page, it will be convenient to refer to the states using the following numbers:

1. [gen, in] 2. [gen, i1] 3. [gen, i2]4. [o, in] 5. [o, i1] 6. [o, i2]7. [r , in] 8. [r , i1] 9. [r , i2]10. [c, in] 11. [c, i1] 12. [c, i2]13. [end, in] 14. [end, i1] 15. [end, i2]

16. [stop, i2]

71


At the first step, we place in the table the costs of getting to all the nodes immediately adjacent to1:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161 ∞ ∞ ∞ 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞

Now we select node 4 and look at its neighbors (11,14,7,10,13,5,8) to update the table,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161,4 ∞ ∞ ∞ 0 0 ∞ 0 1 ∞ 1 1 ∞ 1 1 ∞ ∞

At this point the lowest cost nodes other than 1,4 are 5,7, so we choose one of them to treat next:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161,4,5 ∞ ∞ ∞ 0 0 ∞ 0 0 0 1 1 0 1 1 ∞ ∞

At this point we choose one of 7,9,12. Choosing 7, we find a better paths to 10,11,13 and 14than we had before, so the cost of the shortest paths found so far goes down in these cases from theprevious values:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161,4,5,7 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 ∞ ∞

Continuing in this way:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161,4,5,7,8 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 ∞ ∞

1,4,5,7,8,9 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 0 ∞1,4,5,7,8,9,10 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 0 ∞

1,4,5,7,8,9,10,11 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 0 ∞1,4,5,7,8,9,10,11,12 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 0 ∞

1,4,5,7,8,9,10,11,12,13 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 0 ∞1,4,5,7,8,9,10,11,12,13,14 ∞ ∞ ∞ 0 0 ∞ 0 0 0 0 0 0 0 0 0 ∞

1,4,5,7,8,9,10,11,12,13,14,15 ∞ ∞ ∞ 0 0 0 0 0 0 0 0 0 0 0 0 01,4,5,7,8,9,10,11,12,13,14,15,16 ∞ ∞ ∞ 0 0 0 0 0 0 0 0 0 0 0 0 0

1,4,5,7,8,9,10,11,12,13,14,15,16,6 ∞ ∞ ∞ 0 0 0 0 0 0 0 0 0 0 0 0 01,4,5,7,8,9,10,11,12,13,14,15,16,6,2 ∞ ∞ ∞ 0 0 0 0 0 0 0 0 0 0 0 0 0

1,4,5,7,8,9,10,11,12,13,14,15,16,6,2,3 ∞ ∞ ∞ 0 0 0 0 0 0 0 0 0 0 0 0 0

Now we can use this table to prune out all the sub-optimal successful paths, using the followingalgorithm:

For each non-empty transition, A→ w B,

a. if B is non-final and the minimum cost of reaching B is not equal to the minimum cost ofreaching A plus the cost of this transition, then eliminate the transition;

b. if B is final and the minimum cost of reaching a final state is not equal to the minimumcost of reaching A plus the cost of this transition, then eliminate the transition.

72


This pruning algorithm differs from the standard best paths algorithm in its special treatment of finalstates. The need to treat final states, states reached by successful paths, in this way is noted by Albro(1997, §2.4.2.1).

In our example, the only states that have non-0 cost are 1,2, and 3, that is: [gen, in], [gen, i1] and[gen, i2]. Notice that these states do not occur on the right side of any transition. So the pruningalgorithm, in this particular case, eliminates all of the transitions with any cost associated with them.

After this pruning step, all transitions are the optimal ones, and so we can eliminate the outputsfrom all arcs, to obtain the following finite machine:


[o, i1]→ C [r , i2] [o, i2]→ [r , i2][o, i1]→ [r , i1]

[o, in]→ [r , in]

[o, i1]→ 〈C〉 [o, i2][o, in]→ 〈V〉 [o, i1]



[c, i1]→ 〈C〉 [c, i2]


[stop, i2]→ ε

This machine represents the infinite set of candidates that remain optimal after the constraint Onshas applied. Notice that this machine does not accept .VC. because this string violates Ons while someother structures do not.

Getting to computer to calculate the result, we have:11

11My version of the AT&T fsmprune refuses to prune the suboptimal paths from this machine, because the machine is“cyclic” in some sense. I use my own implementation of Dijkstra’s algorithm to compute this result.

73


0 1.

2

[]

6<V>

3

[]

4

[]

8

V

5V 7

<V>

[]

11

C

10

<C>

[]

<V>

.<V>

[]9C

12<C>

.

<C>

[]

[]

<C>

.

13.

[]

[]

[]

[]

We can see that the sub-optimal paths have been removed from this machine, so that the machine willnot accept candidates that are starred in the first column of the table on page 1, like .VC., while V.〈C〉and VC. are accepted.

6.8 The other constraints

NoCoda: like gen except mark transitions that allow coda

Fillnuc: like gen except mark transitions that skip putting input V in the nucleus

Parse: like gen except mark transitions 〈V〉, 〈C〉 that “underparse” the inputFillons: like gen except mark transitions that skip putting input C in the onset

Each of these can be intersected and then pruned, in order of dominance. The result is a machine thataccepts just the optimal candidates.

Applying NoCoda and pruning, we obtain:

0 1.

2

[]

3

<V>

6V

4[]

5

<V>

[]

7C

8<C>

.

9

<C>

.

<V>

[] <C>

[][]10

.

[]

Applying Fillnuc and pruning:

0 1.

2[]

3V

4<C>

5.

74


This automaton is unchanged by Parse and Fillons.

6.9 Problem set

Prince and Smolensky (1993, §6.2.1) observe that the simple account of syllable structure assumedhere could be derived from more basic principles. Let’s consider a possible derivation of one aspectof the simple system.

1. Modify the machine gen so that it accepts more than one consonant in onsets and codas, and morethan one vowel in nuclei. Let’s call this machine gen0.

2. Explicitly represent, as a finite state transducer, the constraint *Complex: no more than one C orV can associate to any syllable position.

3. Prince and Smolensky (1993, §6.2.1) suggest in effect that using gen is equivalent to having gen0together with *Complex, since this constraint “will stand at the top of the hierarchy and will there-fore be unviolated in every system under discussion.”

There are a couple of claims here which we can now consider from our formal perspective:

a. Is the machine gen equivalent to BestPaths(gen0∩*Complex)? Defend your answer.

b. Extra Credit: Is it true that for all inputs,

BestPaths(BestPaths(gen0 ∩ *Complex)∩ Input) =BestPaths(BestPaths(gen0 ∩ Input)∩ *Complex)?

Defend your answer.

75


7 OTP: Primitive optimality theory

7.1 Review

(1) Eisner defines the “primitive optimality theory” framework, which Albro modifies and extends.

(2) Phonological representations are gestural scores (cf. Browman and Goldstein, Cole and Kisse-berth) Autosegmental associations correspond to temporal coincidence.

L

+v +t +v -t -v -t

H

+v +t

H:------[+]L:[+++++]--v:[+|+]-[+]t:[+]---[+]

Note that consecutive edges on a given tier ][ are allowed to occur at a single point in time, andare denoted by: |. And since all features are monovalent, bivalent features correspond to twotiers, and we add a (typically undominated) constraint expressing the fact that the two featuresnever coincide.

(3) Input: gen(input) is represented by a finite state machine that accepts everything compatiblewith the input, with tuples labeling the arcs that specify, intuitively, what is happening on everytier at a given point in time.

As in two level automata, distinct tiers represent underlying S and surface S.With the convention that arcs which allow anything else to happen on all other tiers are sup-pressed, gen(S) is something like this (n.b. interiors and exteriors are arbitrarily extensible):

0

-_S

1[_S

+_S

2]_S

-_S

(4) The constraints of OTP are given as follows:

α → β ∀α∃β(α and β coincide temporally at some point),where α ∈ conjunction closure of edges x[, ]x and interiors x, andwhere α ∈ disjunction closure of edges x[, ]x and interiors x

α ⊥ β ∀α ∃β(α and β coincide temporally at any point),where α,β ∈ conjunction closure of edges x[, ]x and interiors x

(5) For example, nas → nas says that every surface nasal must overlap an underlying nasal atsome point. cor ⊥ lab says that no segment is both coronal and labial.

(6) Implementation: Each constraint is represented as a deterministic weighted finite acceptor,where the arc labels are tuples which specify, intuitively, what is happening on every tier at thepoint when the arc is traversed. Each violation of each OTP constraint has a weight of 1.

Ranking is implemented by intersection followed by pruning sub-optimal successful paths,capturing the reasoning usually depicted in tables.

76


(7) Albro shows how, complicating gen, we can do some useful things.We can implement insertions and deletions by having an Insertion tier which indicates intervalsin which underlying time is stopped relative to surface time, and a Deletion tier which indicatesintervals in which surface time is stopped relative to underlying time.And a liberalized gen will also allow interspersive morphology.

7.2 Crossing associations prohibited

(8) This follows from the fact that two elements on a tier cannot overlap. So if on tier A, a1 < a2,and on tier B, b1 < b2, and a1 coincides with b2, it follows that b2 does not coincide with a1.

(9) This kind of reasoning can be captured by a tense logic based on “intervals” or “periods.”van Benthem (1991, §I.3) defines a logic on periods with the basic relations inclusion ', prece-dence <, and identity =. The axioms we want for the inclusion relation on periods certainlyinclude the following:

∀xyz(x ' y ' z → x ' z) (transitivity)∀x(x ' x) (reflexivity)∀xy(x ' y ' x → x = y) (antisymmetry)

Defining overlap,

zOy =df ∃u(u ' z ∧u ' y)

it follows, for example, that

∀xy(x ' y → ∀z ' x(zOy).

The axioms for precedence include at least the following:

∀xy(x < y ∨y < x ∨ xOy) (near linearity)

And finally, the axioms which relate inclusion and precedence include the following:

∀xy(x < y → ¬xOy) (separation)∀xy(x < y → ∀u ' x u < y) (left monotonicity)∀xy(x < y → ∀u ' y x < u) (right monotonicity)∀xyz(x < y < z → ∀u ((x ' u∧ z ' u)→ y ' u)) (convexity)

In this logic, we can now prove the “no crossing” condition mentioned in (8), which can now beformulated as follows:

∀wxyz(w < x ∧y < z ∧ xOy)→ ¬wOz.

proof: Since xOy , we know by the definition of overlap that there is a u that is included inboth x and y .By (right monotonicity), since u is included in x and w < x we know that w < u.By (left monotonicity), since u is included in y and y < z we know that u < z.By (transitivity), since w < u < z it follows that w < z.By (separation), since w < z, ¬wOz follows.

77


(10) Enriching van Benthem’s logic with the phonological predicates, the autosegmental structureand gestural scores shown in (2) provide models for the following proposition:

∃xys1s2s3s4( L(x)∧H(y)∧ x < y∧v(s1)∧ t(s1)∧ v(s2)∧¬t(s2)∧¬v(s3)∧¬t(s3)∧ v(s4)∧ t(s4)∧ (s1 < s2 < s3 < s4)xOs1 ∧ xOs2 ∧ xOs3 ∧¬xOs4 ∧¬yOs3 ∧yOs4 )

Notice that while the gestural score must be spelled out with a certain number of “time slices,”no such thing is required in the logical specification.12

7.3 Are intersections safe in OTP?

(11) As discussed earlier, the language-automata situation gets much more complicated when wehave acceptors of tuples, where the relations defined by two machines do not intersect to yieldthe same value as is defined by the intersection of the machines. Let’s call intersections unsafewhen this can happen.

(12) Intersections are safe when the machines are SL, but the OTP machines are not SL, as we cansee from the fact that, for example, L[+] =L[++] =L[+++] = L, just as aε = aεε = aεεε = a.Another way to see this is to observe that an automaton is SL only if it has no path labeled x : ywhere |x| = |y|. In OTP, the coordinates of the tuples labeling the arcs correspond to tiers,and there are paths where one tier has label x and another has label y where |x| = |y|. Thishappens for example in (2) where one L on one tier corresponds to two voiced v segments.The problem is that, for all n > 0, +n = + and −n = −.

(13) We saw earlier that many pairs of machines A,B can be safely intersected even when they arenot SL. If A,B are not conspiratorial, then their intersection is safe.Recall that a pair of machines like the following is conspiratorial. The machines A,B each mapa to b, and so RA ∩ RB = a : b, but the intersection machine A B has no successful paths:

A:0 1 2

a:ε ε:bB:

0 1 2a:εε:b

A B: ∅

Unfortunately, OTP machines are conspiratorial in the same sense, since, for example, thereare paths with tiers having the following labels:

A:a:[++]-b:-[++]

B:a:-[++]-b:--[++]

A B: ∅

Both paths have an a coinciding with a b, but their intersection is empty.Paths like this appear in, for example, the simple gen(S) shown in (3), where tier a = S andb = S.The hint: Machine A conspires with itself, but obviously intersection with itself is safe – why isthat? It is because, along with the conspiring paths, there are infinite many others which willremove the effects of all conspiracies.

12A similar logical formulation is presented in Bird (1995, §2).

78


(14) Transducers A,B are dangerously conspiratorial iff for some x : y that labels a path in A anda path in B, every such path in A conspires with every such path in B.Transducers A,B are not-dangerously-conspiratorial (NDC) iff they are not dangerously con-spiratorial.

(15) Theorem: If transducers A,B are NDC, RL(A)∩L(B) = RL(A) ∩ RL(B).

Proof sketch: If A,B are NC, the result holds, so we need only so the only new cases aremachines that are conspiratorial but not dangerously so. So consider this case. Assume A,Bare conspiratorial, but for every pair of conspiratorial paths in A and B, respectively, labeledx : y , there is some other path labeled x : y in one of the machines that is not conspiratorial.Clearly then, the non-conspiratorial successful paths will be in the intersection and so we willhave RL(A)∩L(B) = RL(A) ∩ RL(B).

(16) Conjecture: OTP intersections are safe.

Proof sketch: The OTP intersections are just those done in calculations like the following, forsome Input and some constraints C1, . . . , Cn

bsp(Cn ∩ . . . bsp(C1 ∩ gen(Input)) . . . )

So wemust show that (i) the first intersection of any constraint C1 with any gen(Input) is safe,and (ii) ifM = bsp(M′) for some earlier intersection M′, then the intersection of any constraintCi with M is safe.

(i) We will show that for any C1 any gen(Input), for every conspiratorial pair of paths, thereis another path with the same label that is not conspiratorial.

Consider first the simple gen, ignoring the Del and Ins tiers for the moment.

It is clear that every constituent on every tier of gen(Input) has arbitrarily extensible interiorsand exteriors.

Furthermore, every constraint allows arbitrarily extensible interiors and exteriors. (Check genand each of the different cases of constraints.)

That means that themachines gen(Input) and C1, although they will have conspiratorial pathsof the sort shown in (13), these paths will never be dangerously conspiratorial, because in bothmachines, interiors and exteriors of all sizes are accepted.

We observed earlier that the single machine gen(S) has a pair of conspiring paths of the sortshown in (13). Now we are noticing that every machine that has one of these elements has theother. So there can be no dangerous conspiracy.

(ii) It is easy to see that the intersection of gen(Input) with C1 will also allow interiors andexteriors to be arbitrarily extensible, and that this property will also be preserved by bsp.

(17) Even though intersection is safe, and so there is no need to go to two-level machines, Eisner(1997b) shows that OTP generation is NP-hard when the size n of the problem is the number oftiers, showing that the NP-complete test for a Hamiltonian path in a graph (Garey and Johnson,1979, §2.5) can be coded as an OTP generation problem.

If the number of tiers is fixed at some low bound, then we do not run into this explosion, buteven a rather small number of tiers can lead to very large machines.

79


7.4 Enlarging the perspective

7.4.1 A first step towards phrasal stress

(18) Recall the Chomsky hierarchy of languages.recursively enumerable languages



regular languages

finite sets



2C-TALs = 2f-MLs

3C-TALs = 3f-MLs

MC-TALs = LCFRLs = MCFLs = MLs

...

TALs = CCLs = LILs = HLs

The class of finite sets and the class of regular sets are closed under intersections, but noneof the larger classes are. However, all of these language classes are closed under intersectionwith regular sets. Bar-Hillel, Perles, and Shamir (1961, §8) observe that, given a context freegrammar G and a finite automaton A, the construction of an intersection context free grammarG A is very simple.

(19) The construction is simplest for finite automata with a single initial and a single final state, butit is clear how to generalize it to sets of initial and final states.

Given a context free grammar G = (V ,N,→, S) and a finite automaton A = (Σ,Q, δ, q0, qf ),let G A = ((V ∩ Σ),Q× (N ∪ V)×Q, →, (q0, S, qf )) where(q1, A, qn+1) → (q1, B1, q2) . . . (qn, Bn, qn+1) if A → B1 . . . Bn and q1, . . . , qn+1 ∈Q;(q1, a, q2) → a if (q1, a, q2) ∈ δ.

(20) Theorem: Given G = (V ,N,→, S) and A = (Σ,Q, δ, q0, qf ), L(G A) = L(G)∩ L(A).

(21) Consider the following simple grammar:

S → DP VP DP → D NP NP → N NP → A N VP → V VP → V NP VP → V PPD → the N → N N N → map N → city N → center N → children N → turnipsDP → he V → saw V → told V → growP → of P → to A → red A → BelgianAnd the finite automaton:

0 1he

2saw

3the

4city

Then G A, eliminating useless categories, is this:(0,S,4) → (0,DP,1) (1,VP,4) (2,DP,4) → (2,D,3) (3,NP,4) (3,NP,4) → (3,N,4)(1,VP,4) → (1,V,2) (2,NP,4) (2,D,3) → (2,the,3) (3,N,4) → (3,city,4)(0,DP,1) → (0,he,1) (1,V,2) → (1,saw,2)(0,he,1) → he (1,saw,2) → saw (2,the,3) → the (3,city,4) → city

80


(22) For weighted automata, we can modify the intersection operation to add the weights to theterminal productions.

If the grammar is also weighted, we can sum any weights on the terminal productions.

(23) Now, suppose that each lexical item is specified by a finite automaton specifying the underlyingand surface phonological properties.

How could we say that each right branch should be stressed?

(24) Perhaps we can extend this to proposals about phrasal stress like those in Hayes (1995), Hayes(1990), Zec and Inkelas (1995), Zec and Inkelas (1990)

(We return to this possibility in §10.1

(25) We can adapt the Viterbi algorithm to select all optimal derivations. (See §10.1)

81


7.4.2 An approach to reduplication

(26) Seki et al. (1991), Vijay-Shanker and Weir (1993), Lang (1994) and others show how the methodof Bar-Hillel, Perles, and Shamir (1961) extends to more expressive grammars.

(27) Seki et al. (1991) define “multiple context free grammars” (MCFGs). An MCFG allows morethan one string component per constituent, and specifies how these string components areassembled. So instead of

S → NP VP we have stS → s

NPt

VPIn an MCFG, we could also have an “inverting” rule like this:

S → NP VP we have tsS → s

NPt

VPIn this notation, lexical rules like

NP → children and VP → read can just be written childrenNP and read

VP .

Each string component on the right side must occur exactly once on the left side.

(28) The additional expressive power comes from the possibility of having multiple string compo-nents. For example, the following four rule grammar generates the “copy language”L(G) = xx| x ∈ a,b∗:stS → s,t

Aas,atA → s,t

Abs,btA → s,t

Aε,εA

This grammar is called a 2-CFG, because no category has more than 2 string components. Thegrammar in (27) is a 1-CFG.

(29) It is convenient to let all lexical items be introduced by lexical productions, in which case thecopy language above could be defined by having a category a

Baand a category b

Bb. The following

grammar does it:stS → s,t

Aus,vt

A → s,tA

uBa

vBa

us,vtA → s,t

AuBb

vBb

ε,εA

aBa

bBb

(30) In general, a copy language over a vocabulary Σ is generated by the following grammar:stS → s,t

A and for all x ∈ Σ, us,vtA → s,t

AuBx

vBx

ε,εA and for all x ∈ Σ, x

Bx

(31) MCFLs are closed under intersection with regular sets, and the construction is fairly simple.

For each n-component category X, we use a category (q11, X, q12, q21, q

22, . . . , q

n1 , q

n2 ) that specifies

the initial and final states of paths that consume the respective string components.

Then our rewrite rules assemble these transitions according to the way in which the stringcomponents are assembled. (See Seki et al. (1991) for full details.)

For stS → s

NPt

VP , for example, we would havest

(q0,S,qf )→ s

(q0,NP,q1)t

(q1,VP,qf ).

Whereas, for the inverting rule tsS → s

NPt

VP we would have ts(q0,S,qf )

→ s(q1,NP,qf )

t(q0,VP,q1) .

For lexical productions sB , we have

s(q0,B,q1) if and only if (q0, s, q1) ∈ δ.

82


(32) For weighted automata, we can modify the intersection operation to add the weights to theterminal productions.

If the grammar is also weighted, we can sum any weights on the terminal productions.

(33) Now, suppose that each lexical item is specified by a finite automaton specifying the underlyingand surface phonological properties.

How could we say that “copied elements” should be identical in as many features as possible?

(34) Perhaps we can extend this to proposals about reduplication like those in (McCarthy and Prince,1995a).

(35) We return to this topic in 10.2

83


7.5 Locality in OTP-like theories

(36) A first reaction to using finite state automata to implement phonology might be this:

a. finite automata are too powerful, since they enforce non-local (in fact, unbounded) depen-dencies, and

b. in the special cases where non-local dependencies seem to exist, like reduplication, finiteautomata are too weak, since they cannot enforce unbounded “copying” correspondences.

(37) Response (36b) still seems appropriate, but the status of reduplication in phonology remainsmysterious.

(38) Response (36a) is inappropriate, since the power of the finite automata is not used to enforcedependencies that are unbounded in time. Rather they serve as a mechanism for synchronizingthe various tiers of phonological representation.

A constituent on a given tier is allowed to endure for an unbounded number of time-slices inorder to span all the associated elements on lower tiers.

If this view is correct, we should be able to establish a formal claim to the effect that the memoryon any given tier is strictly bounded in a certain sense. Let’s explore this idea.

(39) Definition: For any n-tuple, we can let use any number 1 ≤ i ≤ n to pick out the i’th elementof the tuple.

For any 〈a1, . . . , an〉 we let, for example, 2(〈a1, . . . , an〉) = a2.Furthermore, we can extend this idea to automata that are labeled with tuples. For any au-tomaton (Σn,Q,δ, I, F) and any 1 ≤ i ≤ n, we let i(Σn,Q,δ, I, F) = (Σ,Q, δi, I, F) where(qi, a, qj) ∈ δi iff ∃σ ∈ Σn i(σ) = a∧ (qi, σ , qj) ∈ δ.

(40) Definition: For any OTP automaton A, let shrink(A) be the result of replacing every + and −in any label with ε.

(41) Let’s say that a deterministic automaton has the n-Markov property iff its state can dependonly on the previous n symbols accepted.13 That is, there cannot be any two different statesthat can be reached by paths whose labels contain the same last n symbols.

(42) A language L is n-Markov iff for every x,y ∈ Σ∗ and every w ∈ Σn,xw ≡L yw.

A complete deterministic finite automaton is n-Markov iff for every q1, q2 ∈ Q and everyw ∈ Σn,

δ(q1,w) = δ(q2,w).

A deterministic finite automaton is n-Markov iff for every q1, q2 ∈ Q and every w ∈ Σn, ifδ(q1,w) and δ(q2,w) are defined, then

δ(q1,w) = δ(q2,w).

An automaton is Markov iff it is 1-Markov.13The usual “Markov property” is defined for (often probabilistic) state-labeled automata. A state labeled automaton is

said to be Markov, or 1-Markov, iff the probability of the next event may depend only on the current event, not on anyother part of the history.

84


(43) In a Markov automaton, no two different states be reached by the same symbol. In effect,then, the state tells you just what the last symbol was, and nothing more. Obviously, a Markovautomaton over a particular alphabet Σ cannot have more than |Σ| + 1 states: a unique startstate (since the machine is deterministic) and then at most one state per element of Σ.And obviously, for any n, the set of n-Markov automata over Σ is finite.

(44) Conjecture: In any OTP machine (Σn,Q,δ, I, F), for any tier 1 ≤ i ≤ n, shrink(i(Σn,Q,δ, I, F))has the k-Markov property for finite k.

(45) If that is true (for any n), the number of different relations definable by OTP machines over anygiven tiers is finite

(46) (Gold) Any finite set of relations is identifiable in the limit from positive text.

(47) We return to this topic in §10.3

85


8 Lenient compositions: the proper treatment of OT?

8.1 Summary of main claims

(1) Karttunen (1998) presents a different treatment of OT which avoids repeated intersection andcomputation of best successful paths, a treatment which is argued to be more efficient:

For the sake of greater efficiency, we may “leniently compose” the GEN relation andall the constraints into a single finite-state machine that maps each underlying formdirectly into its optimal surface realizations, and vice versa.

(2) Since we are not using the Xerox system, the use of the non-standard notation of that systemis a nuisance, but we can decode it into standard mathematical notation.

(Kaplan)14 Given two binary relations Q,R on some set, the priority union of Q with R is Qtogether with any pairs in R whose first elements are not in the domain of Q:

QPR =df Q∪ (R ¯ dom(Q)).

Given two binary relations Q,R on some set, the lenient composition of R with C is R Ctogether with any pairs in R whose first elements are not in the domain of R C :

ROC =df (R C)PR= (R C)∪ (R ¯ dom(R C)).

(3) In OTP each constraint Ci was implemented by a weighted acceptor Cwi .

Karttunen proposes representing each Ci by identity relations C0i , . . . , C

ni whose domains are

restricted to things that do not violate Ci more than 0, . . . , n times, respectively.

(4) Karttunen’s main claim: for OT constraints C1>> . . .>>Cn that are not violated more than ktimes by optimal parses of Input:

bsp(Cwn ∩ . . .∩ bsp(Cw

1 ∩ gen(Input)) . . . )= (. . . ((. . . (gen(Input)OC0

1)O . . .OCk1 )OC

02)O . . .OCk

n)

(5) A first puzzle:

a. Notice that if O is not associative (and we’ll see that it’s not), then (4) does not immediatelysupport the quote in (1), since in (1) there is no machine that “maps each underlying formdirectly into its optimal surface realizations, and vice versa.” For each Input in (4) we havea different machine; Input is provided at the beginning of the calculation not at the end.

b. And there is a related point (attributed to Eisner) that is worth trying to understand, re-garding the proposal that the constraint Parse be captured by lenient compositions up toa chosen N:

14Like Karttunen (1998), the paper (Kaplan, 1987) where priority union is introduced is also a conference paper. It isamusing to note that, shortly after introducing the priority union operator, Kaplan (1987, p181) says, “Now there are a lotof technical issues, as Stuart [Shieber] has and, I’m sure, will remind me of, concerning this particular operator and whatits algebraic properties are. These are important questions that I at least have not yet worked on . . . ” As we will see, tounderstand Karttunen’s claims, we must work out what some of these basic properties are!

86


The particular order in which the chosen parse constraints apply actually has noeffect here on the final outcome because the constraint languages are in a strictsubset relation: Parse ⊂ Parse1 ⊂ Parse2 ⊂ . . . ParseN.

(6) What is Karttunen’s efficiency claim?

It can’t be this: given constraints C1>> . . .>>Cn that are not violated more than k times by opti-mal parses of Input, it is more efficient to compute bsp(Cw

n ∩. . .∩bsp(Cw1 ∩gen(Input)) . . . )

than to compute (. . . ((. . . (gen(Input)OC01)O . . .OCk

1 )OC02)O . . .OCk

n).That claim makes no sense given the equation in (4).So the claim might be that, using some standard algorithms for computing O vs. ∩ and bsp,the former are more efficient, but the algorithms are not mentioned.

Note that Eisner’s NP-hardness claim mentioned earlier does not depend on which algorithmyou choose (assuming NP=P).

8.2 Compositions: the basics

(7) Recall that for any binary relations Q,R on some set A, we define

R Q = 〈a,b〉| ∃c ∈ A aRc ∧ cQb.

(8) Fact: Given any binary relations Q,R on some set,

a. It can happen that R R = R.For example, if R = 〈a,b〉 then R R = ∅.

b. It can happen that Q R = R Q.For example, if Q = 〈a,b〉 and R = 〈b, c〉 then Q R = 〈a, c〉 = R Q = ∅.

c. It can happen that Q ⊆ R and still Q R = R Q.For example, let Q = 〈a,b〉 and R = 〈a,b〉, 〈b, c〉.Then Q R = 〈a, c〉 = R Q = ∅.

d. Given any binary relations Q,R, S on some set, (Q R) S = Q (R S).(⊆) Assume 〈a,b〉 ∈ (Q R) S. Then for some c, a(Q R)c and cSb. So for some d, aQdand dRc. But then d(R S)b and so 〈a,b〉 ∈ Q (R S).(⊇) is established similarly.

8.3 Lenient compositions: the basics

(9) Fact: Given any binary relations Q,R on some set,

a. It can happen that QOQ =Q.For example, Q = 〈a,b〉, 〈a,a〉.

b. It can happen that QOR = ROQ, even when Q ⊂ R.For example, let Q = 〈a,b〉 and R = 〈a,b〉, 〈b, c〉. Then

87


QOR = (Q R)∪ (Q ¯ dom(Q R))= 〈a,b〉 ∪∅

= Q

ROQ = (R Q)∪ (R ¯ dom(R Q))= ∅∪ R

= R

c. It can happen that COC′ = C′OC , even when C′ ⊂ C and both C,C′ are identity relations.For example, let C = 〈c, c〉, 〈d,d〉 and C′ = 〈c, c〉.

d. WhenC′ ⊂ C and both C,C′ are identity relations, it can happen that (ROC)OC′ = RO(COC′)For example, let R = 〈a, c〉, 〈a,d〉, 〈b, e〉, C = 〈c, c〉, 〈d,d〉, C′ = 〈c, c〉.Then ROC = R, but ROC′ = R′ = 〈a, c〉, 〈b, e〉. So

(ROC)OC′ = (ROC′) = R′

and also ((ROC′)OC) = (R′OC) = R′.However, COC′ = C while C′OC = C′, and so

RO(COC′) = (ROC) = R.

The two centered equations establish the claim: O is not associative.e. It can happen that (ROC)OC′ = (ROC′)OC , even when C,C′ are identities.

For example, let R = 〈a,b〉, 〈a, c〉, C = 〈b,b〉, C′ = 〈c, c〉.f. Sets A,B are comparable iff A ⊆ B or B ⊆ A.

g. Given binary relations R,C,C′ where C,C′ are comparable relations, then it can still happenthat (ROC)OC′ = (ROC′)OC .Proof: Suppose R = (a,d), C = (d, c) and C′ = (d, c), (c, b).Then C,C′ are comparable.

88


(ROC) = (R C)∪ (R ¯ dom(R C))= (a, c) ∪∅

= (a, c)

(ROC)OC′ = (a, c)OC′= ((a, c) C′)∪ ((a, c) ¯ dom((a, c) C′))

= (a, b) ∪∅= (a, b)

(ROC′) = (R C′)∪ (R ¯ dom(R C′))= (a, c) ∪∅

= (a, c)

(ROC′)OC = (a, c)OC= ((a, c) C)∪ ((a, c) ¯ dom((a, c) C)

= ∅∪ (a, c)= (a, c)

So (ROC)OC′ = (ROC′)OC .

(10) Theorem (suggested by Eisner, Karttunen)

Given binary relations R,C,C′ where C,C′ are comparable identity relations, then(ROC)OC′ = (ROC′)OC .

Proof: Consider arbitrary binary relations R,C,C′ where C,C′ where are comparable.(⊆) Suppose (a, b) ∈ (ROC)OC′.By def O, either (i) (a, b) ∈ (ROC) C′ or (ii) (a, b) ∈ (ROC) ¯ dom((ROC) C′).(i) In this case, a(ROC)b and bC′b. So again by def O, either (i.a) (a, b) ∈ (R C) or (i.b)(a, b) ∈ (R ¯ dom(R C)).(i.a) In this case, aRb and bCb. So in this case we have aRb and bCb and bC′b. Consequently(a, b) ∈ (R C′) and so (a, b) ∈ (ROC′). Furthermore, (a, b) ∈ ((ROC′) C), so (a, b) ∈((ROC′)OC).(i.b) In this case, (a, b) ∈ (R ¯ dom(R C)). That is, (a, b) ∈ R but b ∈ dom(C), and sincebC′b and C,C′ are comparable, C ⊂ C′

So then (a, b) ∈ (R C′) and hence also (a, b) ∈ (ROC′). Furthermore, (a, b) ∈ ((ROC′)OC)because (a, b) ∈ ((ROC′) ¯ dom((ROC′) C)).(ii) In this case, (a, b) ∈ (ROC) ¯ dom((ROC) C′). That is, (a, b) ∈ (ROC) and (b, b) ∈dom(C′). So (a, b) ∈ (ROC′) because (a, b) ∈ (R ¯ dom((R C′))). Furthermore, (a, b) ∈((ROC′)OC) because (a, b) ∈ ((ROC′) C).(⊇) Similarly.

89


(11) Phonological theory has not indicated that there is a finite bound on the number of constraintviolations that matter.

Not only that, the strategy of counting violations up to a finite bound that Karttunen proposes,following Frank and Satta (1998), will yield machines that are even bigger than the ones we’vebeen tangling with, and they will be highly redundant.

(12) A final note: Eisner (1999) observes that there is another way to allow the whole phonology tobe compiled into a single finite transducer: evaluate the constraints from left to right or rightto left, removing candidates not on the basis of global score but incrementally, on the basis oftheir being non-optimal up to a given point in processing the input.

It remains to be seen whether this is empirically tenable.

90


9 Acquisition models

9.1 MDL models of acquisition

(1) The learning problem: Upon the very first exposure to a new phenomenon, such as a language,the properties of the phenomenon are unpredictable.

After exposure, the learner notices regularities and can then notice departures from the as-sumed regularities. But what is a “regularity”?

(2) (Bayes) A regularity is a hypothesis that makes the evidence relatively probable:

Bayes’ law: P(H|E) = P(H)P(E|H)P(E) . What are the “prior probabilities” P(H)?

(3) The probability of a piece of data is related to the conciseness of available descriptions – aconnection that was explored in various ways by Shannon (1948), Solomonoff (1964), Rissanen(1976), Wallace (1990), Vitányi and Li (1997ms) and others.

(MDL) A regularity is a hypothesis that makes the statement of the data short, even when thesize of the hypothesis itself is included.

(4) When the measure of hypothesis and data size is “Kolmogorov complexity,” we have whatVitányi and Li (1997ms) call “ideal MDL.”

The Kolmogorov complexity of x is the length of the shortest program for computing x on auniversal machine. Universal machines can simulate each other, and it can happen for somex,y that the Kolmogorov complexity of x is less than y on any universal machine.

This kind of complexity difference is “absolute” in a sense, but this kind of complexity differenceis not what we need to model human learning.15

(5) When the size measures are based on current theoretical conceptions of the domains involved,we introduce a learning bias, one that will be appropriate to the degree that the theoreticalconception is. (At least in principle, we can also ask whether our size measures conform toKolmogorov’s.)

Consequently, the success of such approaches to learning may have an empirical bearing (pos-itive or negative) on theories of the domain.

Furthermore, theMDL formulation suggests a natural heuristic for theory discovery: try simplertheories first.

(6) Ellison (1994b) sketches an MDL approach to phonological learning.

One of the appeals of the approach is the non-arbitrariness of the size measures. In orderto understand this, we digress briefly to pick up a couple of ideas from standard informationtheory.

15Generic or “ideal” simplicity measures do not suffice, and in spite of the discussion in Ellison (1997) on Stabler (1984),Ellison is elsewhere quite clear about this. In the conclusion of Ellison (1994b), we find this point highlighted because thesize measures are based on stipulated data structures called “templates”:

It might be argued that this work is not significant because the learning is led by the choice of template. Thereis no doubt that the choice of template is crucial: the right templates must be used in the right order if theoptimal constraint system is to be found. But this does not mean that this type of learning is uninteresting. . . supplying the learning system with information specifying the format of the analysis is unavoidable. Thequestion is not whether this specification must be made, but how specific it has to be.

We will see below what these templates are and how they work.

91


9.2 Information: basics of the quantitative theory

(7) A sample space Ω is a set of outcomes. An event A ⊆ Ω.Letting 2Ω be the set of subsets of Ω, Kolmogorov’s axioms define a probability measure asa function P : 2Ω → [0,1] such that i. 0 ≤ P(A) ≤ 1 for all A ⊆ Ω, ii. P(Ω) = 1, and iii.P(A∪ B) = P(A)+ P(B) for any disjoint events A,B ∈ 2Ω.

(8) A random (or stochastic) variable on Ω is a function X : Ω → R.

(9) The range of X is sometimes called the sample space of the stochastic variable X, ΩX .

(10) Suppose |ΩX| = 10, where these events are equally likely and partition Ω.If we find out that X = a, how much information have we gotten?

9 possibilities are ruled out. The possibilities are reduced by a factor of 10.

But Shannon (1948, p32) suggests that a more natural measure of the amount of informationis the number of “bits.” (A name from J.W. Tukey? Is it an acronym for BInary digiT?)

How many binary decisions would it take to pick one element out of the 10? We can pick 1 outof 8 with 3 bits; 1 out of 16 with 4 bits; so 1 out of 10 with 4 (and a little redundancy). Moreprecisely, the number of bits we need is log2(10) ≈ 3.32.

92


Exponentiation and logarithms review

km · kn = km+n

k0 = 1

k−n = 1kn

am

an = am−n

logkx = y iff ky = x

logk(kx) = x since: kx = kx

and so: logkk = 1

and: logk1 = 0

logk(MN ) = logkM − logkN

logk(MN) = logkM + logkN

so, in general: logk(Mp) = p · logkM

and we will use: logk1x = logkx−1 = −1 · logkx = −logkx

E.g. 512 = 29 and so log2512 = 9. And log103000 = 3.48 = 103·100.48. And 5−2 = 125 , so log5 1

25 = −2We’ll stick to log2 and “bits,” but another common choice is loge, where

e = limx→0

(1+ x)1x =

∞∑

n=0

1n!≈ 2.7182818284590452

Or, more commonly, e is defined as the x such that a unit area is found under the curve 1u from

u = 1 to u = x, that is, it is the positive root x of∫ x1

1udu = 1. In general: ex = ∑∞

k=0xk

k! . And

furthermore, as Euler discovered, eπ√−1 + 1 = 0.

Using log2 gives us “bits,” loge gives us “nats,” and log10 gives us “hartleys.”It will be useful to have images of some of the functions that will be defined in the next couple ofpages.

-7

-6

-5

-4

-3

-2

-1

0

0 0.2 0.4 0.6 0.8 1

log2x

0

1

2

3

4

5

6

7

0 0.2 0.4 0.6 0.8 1

-log2x

surprisal as a function of p(A): −logp(A)

octave makes drawing these graphs a trivial matter. The graph above is drawn with the command:>x=(0.01:0.01:0.99)’;data = [x,log2(x)];gplot [0:1] data

93


0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1

-x log2x

entropy of p(A) as a function of p(A): −p(A)logp(A)

>x=(0.01:0.01:0.99)’;data = [x,(-x .* log2(x))];gplot [0:1] data

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1

-(1-x)log2(1-x)

entropy of 1− p(A) as a function of p(A): −(1− p(A))(log(1− p(A))

>x=(0.01:0.01:0.99)’;data = [x,(-(1-x) .* log2(1-x))];gplot [0:1] data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

-x log2x - (1-x)log2(1-x)

sum of previous two: p(A)logp(A)− (1− p(A))(log(1− p(A))

>x=(0.01:0.01:0.99)’;data = [x,(-x .* log2(x))-(1-x).*log2(1-x)];gplot [0:1] data

94


(11) If the outcomes of the binary decisions are not equally likely, though, we want to say somethingelse. The amount of information (or “self-information” or the “surprisal”) of an event A,

i(A) = log1

P(A)= −logP(A)

So if we have 10 possible events with equal probabilities of occurrence, so P(A) = 0.1, then

i(A) = log10.1

= −log0.1 ≈ 3.32

(12) The simple cases still work out properly.

In the easiest case where probability is distributed uniformly across 8 possibilities in ΩX , wewould have exactly 3 bits of information given by the occurrence of a particular event A:

i(A) = log1

0.125= −log0.125 = 3

The information given by the occurrence of ∪ΩX , where P(∪ΩX) = 1, is zero:

i(A) = log11= −log1 = 0

And obviously, if events A,B ∈ ΩX are independent, that is, P(AB) = P(A)P(B), then

i(AB) = log 1P(AB)

= log 1P(A)P(B)

= log 1P(A) + log 1

P(B)= i(A)+ i(B)

(13) However, in the case where ΩX = A,B where P(A) = 0.1 and P(B) = 0.9, we will still have

i(A) = log10.1

= −log0.1 ≈ 3.32

That is, this event conveys more than 3 bits of information even though there is only one otheroption. The information conveyed by the other event

i(B) = log10.9

≈ .15

95


9.2.1 Entropy

(14) Often we are interested not in the information conveyed by a particular event, but by the infor-mation conveyed by an information source:

. . . from the point of view of engineering, a communication system must face theproblem of handling any message that the source can produce. If it is not possible orpracticable to design a system which can handle everything perfectly, then the systemshould handle well the jobs it is most likely to be asked to do, and should resign itself tobe less efficient for the rare task. This sort of consideration leads at once to the necessityof characterizing the statistical nature of the whole ensemble ofmessages which a givenkind of source can and will produce. And information, as used in communicationtheory, does just this. (Weaver, 1949, p14)

(15) For a source X, the average information of an arbitrary outcome in ΩX is

H =∑

A∈ΩX

P(A)i(A) = −∑

A∈ΩX

P(A)logP(A)

This is sometimes called “entropy” of the random variable – the average number of bits perevent. So called because each P(A) gives us the “proportion” of times that A occurs.

(16) For a source X of an infinite sequence of events, the entropy or average information, theentropy of the source is usually given as their average probability over an infinite sequenceX1, X2, . . . , easily calculated from the previous formula to be:

H(X) = limn→∞

Gn

n

where Gn =

−n∑

A1∈ΩX

∑

A2∈ΩX

. . .∑

An∈ΩX

P(X1 = A1, X2 = A2, . . . , Xn = An)logP(X1 = A1, X2 = A2, . . . , Xn = An)

(17) When the space ΩX consists of independent time-invariant events whose union has probability1, then

Gn = −n∑

A∈ΩX

P(A)logP(A),

and so the entropy or average information of the source in the following way:

H(X) =∑

A∈ΩX

P(A)i(A) = −∑

A∈ΩX

P(A)logP(A)

This is sometimes called the per word entropy of the process.

(18) If we use some measure other than bits, a measure that allows r -ary decisions rather than justbinary ones, then we can define Hr(X) similarly except that we use logr rather than log2.

(19) Shannon shows that this measure of information has the following intuitive properties (asdiscussed also in the review of this result in Miller and Chomsky (1963, pp432ff)):

96


a. Adding any number of impossible events to ΩX does not change H(X).

b. H(X) is a maximum when all the events in ΩX are equiprobable.

(see the last graph on page 94)

c. H(X) is additive, in the sense that H(Xi ∪ Xj) = H(Xi)+H(Xj) when Xi and Xj are inde-pendent.

(20) We can, of course, apply this notion of average information, or entropy to a Markov chain X.In the simplest case, where the events are independent and identically distributed,

H(X) =∑

qi∈ΩX

P(qi)H(qi)

9.2.2 Codes and MDL

(21) Shannon considers the information in a discrete, noiseless message. Here, the space of possibleevents ΩX is given by an alphabet (or “vocabulary”) Σ.A fundamental result is Shannon’s result that the entropy of the source sets a lower bound onthe size of the messages.

Using the definition of Hr in (18), Shannon (1948) proves the following famous theorem:

Suppose that X is a first order source with outcomes (or outputs) ΩX . Encoding thecharacters of ΩX in a code with characters Γ where |Γ | = r > 1 requires an average ofHr(X) characters of Γ per character of ΩX .

Furthermore, for any real number ε > 0, there is a code that uses an average ofHr(X)+ ε characters of Γ per character of ΩX .

(22) Sayood (1996, p26) illustrates some basic points about codes with some examples. Consider:

message code 1 code 2 code 3 code 4a 0 0 0 0b 0 1 10 01c 1 00 110 011d 10 11 111 0111

avg length 1.125 1.125 1.75 1.875

Notice that baa in code 2 is 100. But 100 is also the encoding of bc.

We might like to avoid this. Codes 3 and 4 have the nice property of unique decodability. Thatis, the map from message sequences to code sequences is 1-1.

(23) Consider encoding the sequence

9 11 11 11 14 13 15 17 16 17 20 21

a. To transmit these numbers in binary code, we would need 5 bits per element.

b. To transmit 9 different digits: 9, 11, 13, 14, 15, 16, 17, 20, 21, we could hope for a somewhatbetter code! 4 bits would be more than enough.

97


c. An even better idea: notice that the sequence is close to the function f(n) = n + 8 forn ∈ 1,2, . . . The perturbation or residual Xn − f(n) = 0,1,0,−1,1,−1,0,1,−1,−1,1,1, so it sufficesto transmit the perturbation, which only requires two bits.

(24) Consider encoding the sequence,

27 28 29 28 26 27 29 28 30 32 34 36 38

This sequence does not look quite so regular as the previous case.

However, each value is near the previous one, so one strategy is to let your receiver know thestarting point and then send just the changes:

(27) 1 1 -1 -2 1 2 -1 2 2 2 2 2

(25) Consider the follow sequence of 41 elements, generated by a probabilistic source:

axbarayaranxarrayxranxfarxfaarxfaaarxaway

There are 8 symbols here, so we could use 3 bits per symbol.

On the other hand, we could use the following variable length code:

a 1x 001b 01100f 0100n 0111r 000w 01101y 0101

With this code we need only about 2.58 bits per symbol

(26) Consider

1 2 1 2 3 3 3 3 1 2 3 3 3 3 1 2 3 3 1 2

Here we have P(1) = P(2) = 14 and P(3) = 1

2 , so the entropy is 1.5/bits per symbol.

The sequence has length 20, so we should be able to encode it with 30 bits.

However, consider blocks of 2. P(1 2) = 12 , P(3 3) = 1

2 , and the entropy is 1 bit/symbol.

For the sequence of 10 blocks of 2, we need only 10 bits.

So it is often worth looking for structure in larger and larger blocks.

(27) Example: Suppose a learner hears Input = (CV,V,CVCV,CV).

a. The prefix tree acceptor pt(Input) accepts exactly this set:

0

1C

5

V

2V

3C

4V

98


b. However, there are smaller machines that accept exactly the set Input:

0

1C

4V

2V3

CV

c. And there are even smallermachines which accept supersets of Input, such as themachine:

0

CV

d. What size measure are we using here? We can specify any deterministic automaton x =(Q,Σ, δ, q0, F), by specifying the triples (q1, a, q2) ∈ δ, the initial state q0 ∈ Q and thefinal states F . Calculating the number of bits needed for this, we know that the followingwill suffice:

I(x) = |δ|(2(log2 |Q|)+ log2 |Σ|)+ log2 |Q| + |F|(log2 |Q|),where |δ| is the number of triples in δ and |F| is the number of final states in F .16

e. So for example,

I(pt(Input)) = 6(2(log2 6+ log2 2))+ log2 6+ 3(log2 6)= 5(2(2.58+ 1))+ 2.58+ 3(2.58)= 46.12 bits

The latter, totally permissive machine A, on the other hand,

I(A) = 2(2(log2 2+ log2 2))+ log2 2+ 1(log2 2)= 2(2(1+ 1))+ 1+ 1= 10 bits

f. Given a machine x = (Q,Σ, delta, q0, F), the result of merging states qi, qj ∈ Q is themachine A′ which has these two states replaced by a new state qij as follows:

A′ = 〈((Q− qi, qj)∪ qij,Σ, delta′, q0′, F ′〉

where: q0′ = qij if q0 is either qi or qj F ′ = (Q − qi, qj) ∪ qij, and δ′ is the result ofreplacing all instances of both qi and qj by qij in all the triples (qn,a, qm) that define

16A more careful calculation would include the extra bits needed to make sure that our encoding of the machine isuniquely decodable, and would take advantage of entropy differences between the different elements of the triples in δ.

99


δ. And let’s define the relation MERGE which relates any automaton x to the result ofmerging two different states in x.Then it is obvious that

xMERGEy → I(y) < I(x)xMERGEy → L(x) ⊆ L(y).

g. So one way to “generalize” from the data is tomerge states, which amounts to assuming thattwo different prefixes are actually grammatically equivalent (i.e. they have the same “goodfinals”). But minimizing the automaton to the limit gives us the machine that accepts allstrings, which, intuitively, amounts to the assumption that the language has no structure:every input is as good as any other.

h. What restricts generalization??One idea: a priori constraints on the range of possible languages, as in, for example, Angluin(1982), Kanazawa (1996).Another idea: Restriction comes from the way the theory fits the data. The fit between Σ∗and the input is too loose, the fit between the prefix acceptor and the input is too tight.An intermediate position can be described in MDL theory: the desired fit is the one thatminimizes the combined size of the data and the model.And of course, these two ideas can be combined.

i. The elements in a sequence accepted by a deterministic acceptor can be given by the paththrough the acceptor. The following number of bits is a rough estimate:

m∑

i=1

|si|∑

j=1log2 zi,j,

where m is the number of sentences in the sequence of strings encoded, |si| is the lengthof the i’th string si, and zi,j is the number of ways to leave the state reached on the j’thsymbol of sentence si (counting “halt” as a “way to leave”).

j. Consider the prefix tree acceptor for Input = (CV,V,CVCV,CV). An element from this setcan be specified by saying which binary choice is made in states 0 and 2. In state 0, let 0indicate the upper path, and in state 2 let 0 mean stop. Then the Input is described by thefollowing 7 bits:

00,1,01,00.

In the smaller machine that accepts Σ∗, there is a 3-way choice at state 0. We could, forexample, let 0 mean stop, 1 mean “accept C” and 2 mean “accept V”. Then the input isaccepted by the following base3 sequences:

120,20,12120,120.

This sequence represents 13 3-way choices, which in bits is approximately

13∗ log23 = 20.605 bits.

More general models require longer specifications of the data.

100


k. With the measures above,

MDL(pt(Input)) = I(pt(Input))+ I(D,pt(Input))= 46.12+ 7= 53.12 bits

MDL(A) = I(A)+ I(D,A)= 10+ 20.605= 30.605 bits

So at this point, with these measures, A looks superior. However, if we hear the same datatwice more, then pt(Input) looks better, since (46.12+ 21) < (10+ 61.815).

Features of the MDL model:

• infrequent inputs will not prompt generalization• size measures define the pressures of finding generalizations versus fitting the data

101


9.3 Ellison’s iterative learning

(28) Ellison (1994b) proposes a learning theory for one-level constraints representing required asso-ciations among autosegments, where these are interpreted as constraints on a single (possiblycomplex) representation:

For example, the constraint that any vowel following a a fronted vowel also be fronted:

[+front]

V V

As discussed in Bird and Ellison (1994, §4.6), this is interpreted as a logical implication whichcan be expressed in terms of concatenation, intersection and complements of simple acceptors,perhaps something like this:

¬(•∗(V front)(V ¬front)•∗).

(29) Various learning tasks of roughly the following kind were performed:

The learner is given a sequence of words, and the target is the set of constraints that hold onwords in the language.

(30) Ellison (1994b) proposes that in addition to constraints, we should allow exceptions to theconstraints.

So then the learner represents the constraints H, exceptions to the given constraints E|H andthe data given the constraints and exceptions, with size:

I(H)+ I(E|H)+ I(D|H,E).

(31) I(H): Ellison (1994b) represents particular constraints like the one in (28) as the instantiationof a general template which imposes some feature F of a vowel on the following vowel:

V V

F

The F in this constraint is a “parameter” which can be instantiated in various ways.

Ellison represents the function from segments to parameters. For |A| segments, and ms pa-rameters, this can be done with |A| log2ms bits.

So Ellison says that the cost of specifying n templates is:

I(system) = I(T)+ I(n)+nI(H|T).

102


(32) I(E|H): (pp14-15) An “exception mark” is “associated with each segment in the corpus wherethe constraint is restrictive.”

Ellison proposes (but the rationale is difficult to follow)

I(E|H) = 2n+N log2N − |R| log2 |R| − |E| log2 |E| −∑x∈χ

log2 pa(x).

where the last term is the cost of storing the exceptional segments χ.

(33) I(D|H,E): (pp16-17)Where a(x) is the segment that occurs at position x, which is not marked as exceptional.The “scaling factor” is the sum of the probabilities of the segments which are permitted tooccur in position x:

px =∑

a∈P(x)pa.

Then the amount of information needed to specify the regular segments is

I(D|H,E) =∑

x∈R− log2

pa(x)

px.

(34) Search: Exhaustive search infeasible, so simulated annealing used.

Note that this means our results, particularly failures of the learner, may be due to the searchmethod rather than to the MDL measures.

(35) Iterative learning:

first search for a single constraint that gives a smaller description of the data than no constraintsdo; then search for additional constraints that give better descriptions of the data than theprevious constraints did on their own.

Note that this means our results, particularly failures of the learner, may be due to the iterativemethod rather than to the MDL measures.

(36) test 1: Consonant clusters: which consonants F cannot be followed by other consonants?

F

C C

*

To avoid a general prohibition on consonant clusters, Ellison adds a prohibition against word-initial clusters, and against all clusters of more than 2 consonants. (NB: this is probably due tothe iterative learning + heuristic search: one a prohibition on consonants is adopted, it cannotbe recovered from.)

103


Given 602 Turkish words taken fro continuous text, the system then found that using thetemplate above with F = b, c, d, g – the voiced stops, minimized description length.

NB: 602 words is a small amount of data. With more data, the weight of exceptions might playa larger role in driving out the cruder generalizations.

(37) test 2: Spreading: what consonants can occur together?

F

C L

The best instantiation found set F to the voiceless consonants and L to the non-continuants.

(38) test 3: Floating: vowel harmony constraints. To indicate that F associates to the leftmost Land the spreading of vowel features, Ellison uses the following diagrams:

F

V LL

F

...

The learning system found two harmonies, even though these harmonies havemany exceptions:

V...

[+front]

V

[+front]

V

V...

[+rnd]

[+high]

[+rnd]

[+high]

104


(39) Questions for Ellison (1994b):

a. Why use a 0-order (i.e. frequency based) code for the exceptions and for the data, ratherthan a more sophisticated higher-order code?

Ellison indicates that higher order dependencies are not used because they are not recordedin the grammar, and we want the learner to adjust the grammar without the interferenceof these dependencies, but the grammar does not record frequencies either.

i. Would we get the same results with a frequency-insensitive (block-code) representationof the data?

ii. Is there a way to define a higher order code that is improved by having the right con-straints?

That is, are grammatical regularities and frequency coding completely different andindependent sources of succinctness, or can they collude to produce better resultsthan the sum of their contributions?

b. Turning the previous questions toward the representation of the grammar, why not use afrequency-sensitive representation of the constraints?

This is motivated by, for example, the idea that a learner might be likely to assume that thesorts of classes identified in one constraint are likely to be relevant in other constraints.

c. Is the problem with identifying the first, consonant cluster constraint due just to the searchstrategy? This could be answered by calculating the MDL of the data with the desiredconstraint vs. the MDL of the data with the total ban on clusters.

If the total ban is more succinct, maybe this is because we are not dealing with enough data,or because the exceptions are not weighted highly enough.

i. What is the MDL of the total ban vs the desired ban?

ii. Is this comparison altered when we have 600,000 words instead of 600?

iii. Are there natural alternative representations of exceptions that give them amore promi-nent influence on the learning?

iv. Why is doing exhaustive calculations for 106 hypotheses too much? That is, what arethe resource demands for the size calculations?

d. We might get interestingly different results with a more sensitive treatment of exceptions.

i. How is learning affected in which it is exceptional (relative to each constraint). Sothen, violations of many constraints would be more serious than violations of a singleconstraint,

ii. Given a more sensitive treatment of exceptions, how should they be handled by alearner?

e. What is the mechanism for handling ranked constraints in this approach – Ellison’s descrip-tion is unclear.

105


9.4 Quantifying generalization

The basic idea of MDL theory is that learners are sensitive to the fit between their hypotheses and thedata. Generalization involves loosening the fit slightly, acknowledging that certain other things couldhave happened, while still trying to be predictive. The MDL measures give us a quantitative grip onthese relations.

Exploring some simple examples, we will be able to answer basic questions like the ones at the endof the previous section.

9.4.1 The data, given the hypotheses

(40) Ellison (1994b) uses a frequency-sensitive code for the data, but does not explain why he doesthis. So let’s compare the options.

(41) First, let’s use a frequency-insensitive coding (“block coding”) of the data.

Let the “size” of each symbol be log2n where n is the number of symbols which could appearin that position, given the hypothesized constraints.

(42) For example, in a lower case alphabetic string, the character a and the character z will have thesame “size,” even though, in most texts, these characters differ in their frequency of occurrence.

Since there are 26 possible lower case alphabetic characters, each character represents 1 choiceout of 26, which is approximately the same as 4.700439718141093 choices out of 2. (sincelog2 26 ≈ 4.700439718141093)So when all and only lower case alphabetic characters are possible, ignoring frequencies, eachcharacter has about 4.7 bits.

(43) Let’s consider our simplest OT example: the simple syllable structure grammar from Princeand Smolensky (1993, §6).

We showed in §6 that the language given by a particular constraint ranking can be defined bya finite state machine, since each gen and each constraint can be.

Ons >> NoCoda >> Fillnuc>> Parse >> Fillons .

If we allow gen to produce any non-empty sequence Σ+, and then apply the first two constraints,we obtain the machine,

bsp(NoCoda ∩ bsp(Ons ∩ gen(Σ+)))

106


0 1.

<C><V>

2C

[]

<C><V>

3

V

[]

<C><V>

4

.

<C>

<V>

C

[]

This automaton accepts 6 different symbols, namely, ., C, V, [], <C>, and <V>.

(44) The syllable structures allowed by this device are distinctive, so let’s use it to generate a sampletext. We generate random text by treating all the arcs leaving each state as equiprobable.

This yields a text like this:

. C [] . <V> [] <C> <C> V . [] V <C> . <V> [] <C> V <C> . <C> C []

. <C> [] V . [] [] . <V> [] V <C> . <C> <C> C [] <C> . <V> [] <C> <C><C> <C> <C> <C> <C> <C> <C> V . [] <C> V <C> . <V> C <C> [] . <V> <C><C> [] <C> [] . [] [] <C> . C <C> <C> V <C> <C> <C> <C> . <V> C <C> V<C> . C V . [] [] <C> . C V . <V> <C> [] V <C> <C> . <C> <C> [] [] <C>. C <C> [] . <V> <C> <C> <C> <C> C <C> V . <C> [] [] . <C> [] <C> V<C> . C V . <V> <C> [] [] . <C> [] <C> [] . [] [] . [] [] . <C> <C><C> C [] <C> <C> . <C> <C> <C> C <C> <C> V <C> <C> <C> . <V> C <C> []<C> . C V . [] <C> [] . <V> C V . C <C> [] <C> <C> . <C> [] V <C>. <C> C <C> [] . <C> C V <C> . [] V <C> . [] <C> <C> [] <C> . <C> [][] . C <C> <C> <C> <C> [] . [] <C> <C> <C> V . C [] . [] <C> V . <V><C> <C> [] [] <C> <C> . [] [] . <C> C V . <V> <C> [] V <C> . C <C> <C>[] . <V> <C> C V . <V> C V <C> . <V> [] [] . <V> <C> [] <C> <C> <C><C> [] . <V> C [] . C V <C> . C [] . [] [] . [] [] <C> <C> . C V . <V><C> <C> <C> <C> <C> [] V <C> <C> . [] <C> [] <C> . C [] <C> . <V> C

Using the size measure discussed above, 500 symbols, with no constraints, has size 500 ∗log2 6 ≈ 1292.48bits.

(45) With this free data from a perfect source, we can get as much as we want of it. If a learner hears3 segments per syllable (some deleted), 3 syllables a minute, for 3 hours a day for 3 years, that’s

107


34 ∗ 60∗ 65 = 1,773,900 segments. So it is not crazy to consider learners that hear some 2Msegments. With that much data, we might expect some of the subtleties to be revealed.

2M ∗ log26 ≈ 5,169,925.001442312bits.(46) Are the results interestingly different with natural data? We should check this out.

But it looks like we have made the learning situation artificially easy, which may be appropriateas a first step: we have unadulterated data from an extremely simple source grammar.

(47) The bsp(Ons ∩ gen(Σ+)) machine is the following:

0 1.

<C><V>

2C

[]

<C><V>

3V

[]

<C><V>

4

.

5

C

[]

<C>

<V>

C

[]

.

<C><V>

(48) The size of the data with respect to bsp(Ons ∩ gen(Σ+)) is log2 of the number of choices ateach point, but now the choices are more restricted.Parsing the text above, the first symbol is determined, and so has size log2 1 = 0bits. There are 4possible symbols that could occur in the second position, so this symbol has size log2 4 = 2bits.Continuing in this way, the size of the text without any grammar is about 4,944,390.61499177bits.So adding the Ons constraint here yields an improvement of about 4.3% over nearly totalignorance (i.e. knowing nothing but what symbols have occurred).

...choices=5 size=1209.54640322171choices=5 size=1211.8683313166choices=6 size=1214.45329381732choices=6 size=1217.03825631804choices=5 size=1219.36018441293choices=5 size=1221.68211250782choices=5 size=1224.00404060271choices=6 size=1226.58900310343choices=6 size=1229.17396560415choices=6 size=1231.75892810487choices=6 size=1234.34389060559...

108


(49) Finally, the size of the data with respect to bsp(NoCoda ∩ bsp(Ons ∩ gen(Σ+))) is about4,498,853.34141284bits.So adding NoCoda yields 9.0% improvement over the previous grammar. And altogether wehave a 12.9% improvement over total ignorance.

...choices=5 size=1104.82019368463choices=4 size=1106.82019368463choices=6 size=1109.40515618535choices=5 size=1111.72708428024choices=5 size=1114.04901237512choices=5 size=1116.37094047001choices=4 size=1118.37094047001choices=4 size=1120.37094047001choices=4 size=1122.37094047001choices=4 size=1124.37094047001...

(50) Now, let’s consider whether things are interestingly different whenwe use a frequency-sensitivecoding of the data, as Ellison did.

In our data, we can calculate the frequencies of the respective symbols:

762182 <C> 380922 . 380855 [] 190786 V 190203 C 95052 <V>

There is a skew in these frequencies which can be exploited in a variable length code.

(51) There are various frequency-based coding schemes. A simple and provably optimal one isHuffman’s. (Good introductions to Huffman coding are provided in, for example, Sayood (1996),Nelson and Gailly (1996). The latter provides C programs for the basic coding methods.)

Given our 6 elements, <V> and C should have the longest codes, so we specify the last bits oftheir codes first, and their combined frequency is 285255:

762182 <C> 380922 . 380855 [] 190786 V 190203 C 95052 <V>

285255

0 1

Again, we calculate the final bits of the two least frequent codes, this time counting the combi-nation of <V> and C as one of the candidates:

762182 <C> 380922 . 380855 [] 190786 V 190203 C 95052 <V>

285255

0 1

476041

0

1

109


Continuing in this way:

762182 <C> 380922 . 380855 [] 190786 V 190203 C 95052 <V>

285255

0 1

476041

0

1

761777

0 1

762182 <C> 380922 . 380855 []190786 V 190203 C 95052 <V>

285255

0 1

476041

0

1

761777

0 1

1237818

0

1

762182 <C> 380922 . 380855 []190786 V 190203 C 95052 <V>

285255

0 1

476041

0

1

761777

0 1

1237818

0

1

2000000

0

1

We can now read the codes off the tree:

<C> 0. 110[] 111V 100C 1010<V> 1011

Notice that no code is a prefix of any other code, so that reading a code from left to right therecan never be any ambiguity in where one symbol ends and the next begins. So the followingbits unambiguously specify the five symbols <C>.CV[].; no boundary markers are needed.

110


1101010100111110

(52) With this Huffman code, the size of our “ignorant” representation of the data, without anylinguistic hypotheses (except the one that the symbols occur with the relative frequencies cal-culated above), is reduced by 7.9% to 4,760,891

(53) How should we calculate the size of the representation with respect to grammatical hypotheses?

One strategy: adjust the probabilities at each point in the input. That means, in effect: everycategory induces its own code, in virtue of what symbols it allows to occur next.

We do not simply want new codes for each category, since that would put unique decodabilityat risk. We want to maintain the prefix property.

One strategy: at each point shift the possible codes into the shortest available code positions.

(54) As observed above, the simple, ignorant representation needs log2 6 ≈ 2.584962500721156bitsper symbol. But in an actual binary code, you either use 2 bits or 3, and so to represent eachsymbol in this case we need ,log2 6- = 3 bits. That means that the simple representation of 2Msymbols needs 6M binary symbols.

All figures in the following table are expressed in terms of binary symbols in this way.

The uncompressed “improvement” figures are relative to the starting size of 6,000,000 bits; theHuffman encoded “improvement” figures are relative to the starting size of 4,760,891 bits.

simple improvement Huffman improvementignorant 6,000,000 4,760,891bsp(Ons ∩gen(Σ+)) 5,142,572 14.2% 4,475,634 5.99%bsp(NoCoda ∩gen(Σ+)) 4,952,722 17.45% 4,570,686 3.99%bsp(NoCoda ∩bsp(Ons ∩gen(Σ+))) 4,380,921 26.9% 4,475,634 5.99%

• Notice that the frequency-based, Huffman coding, by itself, improves succinctness by 20.65%– much more than any single constraint of the grammar. But, surprisingly, simple grammar-based coding catches up with the Huffman coding when we use the ranked combination ofconstraints.

• Notice that the second ranked constraint by itself improves succinctness more than the firstranked constraint by itself, but the combination is unbeaten.

• Notice also that the frequency-based coding appears to leave the basic situation unchangedwith respect to the contributions of the various constraints. So it is not clear that frequency-based compression of the data changes the learning problem in a useful way.

• It is peculiar that, using Huffman coding, bsp(Ons ∩ gen(Σ+)) andbsp(NoCoda ∩ bsp(Ons ∩ gen(Σ+))) yield representations of exactly the same size in thisexample. This might be due to the fact that the codes at ranks 2,3,4 all require 3 bits, and thecodes at ranks 5,6 both require 4 bits.

• Finally, it is important not to trust any general conclusions suggested by this particular ex-ample! General ideas require appropriately general support!

(55) The fundamental idea of grammar is that language is not uniform: at various points, variousthings are liable to happen next. If we are going to use a frequency-based coding strategy, itmakes sense to take advantage of this idea, allowing both rank and codes to shift according tothe state of the parser.

111


To do this, instead of just calculating the frequencies of the symbols in our corpus, we calculatethe frequencies with which each symbol occurs immediately after entering a given state. Foran automaton with k states and 6 symbols, we store these values in a k× 6 matrix.

9.4.2 The hypotheses

(56) The MDL idea is that rather than simply looking for hypotheses that simplify the data, butwe should also look for hypotheses that are simple. That is, a hypothesis should be adaptedonly if the improvement is data complexity is not entirely eaten up by the complexity of thehypothesis.

It is worth exploring how to calculate sizes in such a way that the learner will be led in appro-priate directions by this criterion.

We return to this topic in §10.4.

9.4.3 The exceptions and ranked constraints

(57) The properties of the exceptions are particularly interesting, and so we should identify a learn-ing method that attends to these.

9.4.4 Search

(58) The properties of the search strategy should be carefully explored.

Simulated annealing and similar methods can easily get caught in local minima, depending onthe rate of “cooling,” etc.

9.5 Learning theories for OT

(59) The proposals of Tesar and Smolensky (1998), Albro (1999) have been discussed elsewhere.

112


10 Exercises and speculations

(1) Brief outline of previous discussion:§2 Nerode characterization of finite state languages, transducers§3 rewrite rules as transducers (ordering, cyclicity issues)§3 two level rules as transducers (regularity lost in intersection)§4 computing with nondeterministic machines§5 multiple levels as one (regularity preserved sometimes)

§§6,7 OT constraints as weighted transducers,ranking by best successful paths

§8 OT constraints as sequences of progressively more permissive transducers,ranking by lenient composition

§9 MDL perspectives on acquisition: language acquisition as learning

10.1 How phonology could meet syntax

(2) We suggested in §7.4.1 that phrasal effects on phonology might be modeled by intersecting thephrasal grammar with the phonological representation. We return to this suggestion to takeone or two more steps.One reason to explore this idea is that it will force us to provide a connection between thesyntax and phonology, one that does not give up on the fact that syntax is not finite state.But another reason for exploring this idea is that it is a step towards mixing finite state modelswith grammars that can enforce reduplication correspondences – a matter that is beyond evencontext free power, and one of the troubling shortcomings of finite-state-based approaches.

(3) For practice, let’s start with a very simplistic example. Suppose that we wanted to predictsomething like the following:

( x )( x ) ( x )john sings

( x )( x ) ( x )sings songs

We do not want to use an inviolable rule for this, since the rightmost phrasal stress we see inthese examples can be overridden in other contexts, as for example when the left element isfocused:

( x )( x ) ( x )

(I don’t sing;) JOHN sings

( x )( x ) ( x )

(I write songs, but John) SINGS songsSelkirk (1996, p563) says “The Pitch Accent Prominence Rule is understood to take precedenceover the nuclear stress rule” and notes “This is the sort of constraint interaction that finds awelcome formalization in the context of optimality theory.”To obtain our formalization, we need to define the hierarchy of domains.The domains of stress assignment are embedded in each other, and since we want the rela-tions of relative prominence to be preserved under embedding of the domains,we would needarbitrarily many domains if every phrase defined a new domain. But Hayes (1995) argues that,at least in English and Greek, the domains are distinctively prosodic, and perhaps bounded indepth (Hayes, 1989).

113


(4) For simplicity, let’s consider a context free grammar G in Chomsky normal form. Chomskynormal form grammars have rules of only the following forms:

A→BC A→aWe can intersect these grammars with a language defined by any automaton A = (Q,Σ, δ, I, F).For simplicity, let’s assume that A has no ε transitions.

(5) We define the intersection grammar G′ = A G for grammar G and automaton A as a closureCL(Axioms, reduce1, reduce2) where the “axioms” are exactly the tuples in δ and the“inference relations” reduce1, reduce2 are defined as follows:17

(i, a, j)→ a [axiom] if (i, a, j) ∈ δ

(i, a, j)→ a(i,A, j)→ (i, a, j) [reduce1] if A→ a

(i, B, j)→ α (j,C, k)→ β(i,A, k) [reduce2] for any α,β, if A → BC

The intersection L(G)∩L(A) is nonempty iff A G contains (i, S, j)→ α for some α and somei ∈ I, j ∈ F , where S is the start category of G.It is easy to show that s ∈ L(G)∩ L(A) iff (i, S, j)⇒∗ s for some i ∈ I, j ∈ F .

(6) We presented the standard definition of intersection in §7.4.1, but this method avoids many(but not all) of the useless categories of the other grammar.

A category A is useless iff no derivation of a string from the start category uses A.Notice also that in this definition of A G, the contribution of A is limited to the axioms, andthe contribution of G is limited to the two “inference rules.” This partition of responsibilitysimplifies the presentation.

(7) Example: S→NP VP VP→V NPNP→john NP→songs VP→sings V→sings

0 1john

2sings

Then G A = CL(axioms, reduce1, reduce2) is the following grammar:

(0,john,1)→ john (1,sings,2)→ sings(0,NP,1)→(0,john,1) (1,VP,2)→ sings(0,S,2)→(0,NP,1) (1,VP,2)

Notice that there are no rules for the terminal songs, and no rule (i,VP,k)→ (i,V,j) (j,NP,k). Thepremises of the “inference rules” serve to block the introduction of such things, since theycannot possibly be used in any successful derivation of a string in the intersection.

17This method for calculating the intersection is adapted from the Cocke-Younger-Kasami (CYK) parsing algorithm (Ahoand Ullman, 1972; Sikkel and Nijholt, 1997).

114


(8) Example: We can assign positive weights to the productions in the grammar, as for example inthe following, where only non-0 weights are marked:S→NP VP VP→V NPNP→john NP→songs VP→sings V→singsNP→Jóhn NP→sóngs VP→síngs V→síngsNP

1→john NP

1→songs VP

1→sings V

1→sings

NP1→jóhn NP

1→sóngs VP

1→síngs V

1→síngs

0 1john

2sings

jóhn síngs

(Obviously, we would like a better representation of our constraints on phrasal stress than this!This simple grammar is just for practice!)

(9) Suppose that we have a weighted context free grammar and we want to restrict it so that it onlyaccepts strings with derivations of minimal weight. Intuitively, for each qi,A, qj we find the(qi,A, qj,w) with minimal weight w, and then delete all elements of the closure that are notrelated by an inference rule to a final item.

To adapt “Viterbi’s algorithm” for probabilistic Markov models (Viterbi, 1967; Stolcke, 1995;Krenn and Samuelsson, 1996) to our simple weighted grammars, we keep just the least weightelements of the closure:

To perform this computation, we associate a “total weight” with each rule – written in paren-theses.

What wewant to do is to introduce a rule (i,A, j) w(wt)→α only if the total weight of constituents

built by the rule are minimal, in the sense that there is no way to build an A between i andj with a total weight wt′ < wt . When the closure is complete, we drop the “total weights” inorder to obtain a regular weighted grammar.

115


(i, a, j) w(w)→ a [axiom] if (i, a, j,w) ∈ δ

and ¬∃(i, a, j,w′) ∈ δ such thatw′ < w

(i,a, j) w1(w1)→ a(i,A, j) w2(w1+w2)→ (i, a, j)

[reduce1] if A w2→ a

and ¬∃(i, a′, j) w′1(w

′1)→ a′,

Aw′2→ a′ such that

w′1 +w′

2 < w1 +w2

(i, B, j)w1(wt

1)→ α (j,C, k)w2(wt

2)→ β

(i,A, k)w3(wt

1+wt2+w3)

→ (i, B, j) (j, C, k)[reduce2] if A w3→ BC

and ¬∃(i, B′, j) w′1(w

t1′)

→ α′,

(j, C′, k)w′2(w

t2′)

→ β′,

Aw′3→ B′C′ such that

wt1′ +wt

2′ +w′

3 < wt1 +wt

2 +w′3

(10) For example, given the grammar G and automaton A of (8), we compute G A as follows:

(0,john,1)0(0)→ john (0,jóhn,1)

0(0)→ jóhn (1,sings,2)

0(0)→ sings (1,síngs,2)

0(0)→ síngs

(0,NP,1)0(0)→ (1,john,1) (0,NP,1)

0(0)→ (0,jóhn,1) (1,VP,2)

0(0)→ (1,sings,2) (1,VP,2)

0(0)→ (1,síngs,2)

(0,S,2)0(0)→ (0,NP,1) (1,VP,2)

Notice that this grammar will not generate jóhn sings or john sings or jóhn síngs, since these allinvolve using productions with non-zero weight in building the S from 0 to 2, when we can getan S from 0 to 2 with no weight.

(11) If we had instead the following automaton, perhaps because of some prior application of aFocus Prominence rule or whatever, our method above will yield a grammar that generates theoptimal weight-2 parse of Jóhn sings.

0 1 2singsjóhn

(0,jóhn,1)0(0)→ jóhn (1,sings,2)

0(0)→ sings

(0,NP,1)1(1)→ (1,jóhn,1) (0,NP,1)

0(0)→ (0,jóhn,1) (1,VP,2)

0(0)→ (1,sings,2) (1,VP,2)

1(1)→ (1,sings,2)

(0,S,2)0(2)→ (0,NP,1) (1,VP,2)

(12) The example grammar of (8) does not provide a good representation of the Nuclear Stress rule:(i) it requires a specification on each branching rule, and (ii) it requires redundant stressed andunstressed forms for every lexical production.

116


To solve (i), we need to have control over how the “string components” of the categories areassembled in a 1-CFG-like grammar.18

To solve (ii), we can let every lexical item be a regular set.

(13) For any word w, let’s write w for gen(w), and let @ be the phrase concatenation function:s@tS → s

NPtVP

s@tVP → s

VtNP

NP→john NP→songs VP→sings V→sings

Now we can define @ so that it imposes the Nuclear Stress rule, preferring stress on the rightconstituent if that has not already been ruled out by some other constraint.

Given regular sets s, t (or automata As,At that accept them), how should s@t be defined?The problem begins to get linguistically interesting here.

(14) Suppose we assumed that the syntactic domains are the domains of the stress rules.

Then we could implement the stress rule with a strategy rather like the one we find for primaryand secondary word stress in Eisner (1997a):

• we project a domain spanning s, t.• we require that the domain contain (overlap) a stress mark (faithfulness)

• we require that the stress mark coincides with a lower stress mark (continuous column)• we require that the stress mark is positioned as far to the right as possible using constraintsthat, e.g. require the domain of the mark to be aligned on the right, and then penalize everyoverlap of that domain with a syllable

• we run up against the bounds of finite-stateness though if we try to enforce stress contrastsproperly in arbitrarily deeply embedded domains

(15) These same mechanisms could be used in the better account, where the domains are properlyprosodic.

Even on this account, though, we need to allow for the influence of syntax on prosodic con-stituency.

For example, vocatives, appositives, parentheticals, preposed clauses, nonrestrictive relativesdefine their own inflectional phrases.

Selkirk (1996, p567) also observes the following contrasts in examples from Liberman:

a. I (Three mathematicians in ten) I (derive a lemma)

b. *I (Three mathematicians) I (in ten derive a lemma)

c. I (Three mathematicians) I (intend to rival Emma)

(16) Other, more subtle syntactic influences are observed by Hayes (1989).

He proposes that above the word level we have clitic group, phonological phrase, intonationalphrase, and utterance domains.

Clitic groups are formed by adjoining leftward or rightward to a content word (the “host,” alexical category.

18We introduced n-CFG’s in §7.4.2.

117


In this framework, he finds that the propensity to cliticize is inversely related to the number ofintervening syntactic boundaries.

This suggests a probabilistic, multiplicative scheme rather than an additive weight scheme ofthe sort proposed here.

(17) Given a specification of the exact influence of syntax on prosodic domain, we could studywhether it could be implemented with a strategy like the one sketched here.

More needs to be done to see whether this is empirically and formally appealing! But we havemaybe done enough to set the stage for a consideration of another place where we need toenforce dependencies that are beyond the finite state machinery that suffices for most things.

A different perspective on how to go about developing this story is proposed in the next section.

118


10.2 A proper treatment of reduplication

(18) Eisner (1997c, §7) says “reduplication occupies a special role in phonology, in that it is inher-ently non-local; it cannot be analyzed as local. Therefore, to handle reduplication in OTP weneed a representational trick . . . ”Faithfulness is enforced in OTP only at temporally coincident points of the I-B forms on separatetiers. So for correspondence, we can act as if there is B-R temporal coincidence.

To allow for juncture effects, Eisner copies the base to a special set of tiers to be treated asif temporally coincident with the base, and then a special mechanism to handle the junctureeffects which require that the actual temporal position of the base and reduplicant be properlyrecognized.

(19) Albro pointed out some problems for this idea:

One important problem is that if the base is represented (and copied) as a finite state machine,the machine does not know which path through one was taken when it is going through theother. It is at least difficult to get the correspondences properly enforced on this approach.

(20) The opposite idea is to consider what else could profitably be regarded as non-local – ideally, wewould like to find other things that could naturally be treated as involving the sort of copyingwe find in reduplication.

The copying mechanism then would not be an ad hoc trick required for just one or two phe-nomena.

(21) Gafos (1998) provides an example of this strategy, arguing on the basis of data from theMalaysian language Temiar that consonantal spreading over a vowel should be regarded asa kind of reduplication, copying a consonant from one position to another.

Gafos also observes that with this analysis of “long distance consonantal spreading” (LDC), wedo not need to place consonants and vowels on separate tiers (“V/C planar segregation”) tomake the spreading local.

(22) Carrying this opposite idea to the limit: we adopt a formalism that uniformly subsumes “localeffects” and copying correspondences.

Let’s explore this extreme.

(23) To start, we will use just one basic idea from the previous section, namely:

given a constraint that that penalizes certain adjacency relations, where this constraint is im-plemented by an automaton A,and given a grammar G that assembles elements, e.g. with rules like st

w → sw

taffix ,

we can enforce the constrained adjacency relations in an intersection G A.(Here, we don’t need anything like the fancy @ which was supposed to project (or interactwith the phonology to require) domain edges of appropriate kinds at appropriate intermediatepoints.)

119


(24) As a first step, suppose that we had only perfect, complete reduplication, as in the AxinincaCampa nata-nata (McCarthy and Prince, 1995a, §3.7).

The lexical specification of the linear sequence of segments in the base could be given in gram-mar format, as in this little grammar for wnata:

nswnata

→ swata

aswata

→ swta

tswta

→ swa

awa

Then we can get the reduplicated form with a small change:19

ns,ntwnata

→ s,twata

as,atwata

→ s,twta

ts,ttwta

→ s,twa

a,awa

This enforces the crossing R-B relations.

(25) One problem here is that the provision of the duplicated material is unlike anything else goingon in the grammar.

A second problem is that this grammar enforces a perfect identity between correspondents.To allow for over- and under-application effects, we could allow less than perfect identity withMaxRB and DepRB penalties.

But let’s think first. This is not going towards our goal.

19Remember that 2-CFG’s do not allow any string on the right to appear more than once on the left! That is, we do nothave full copying of strings in this grammar. We could of course allow this, but then the grammar formalism is muchmore powerful (Seki et al., 1991). Whether we need such power in the phonology or syntax of human languages is anempirical question, but it appears not.

120


(26) The goal.

We have seen how, with tiers for underlying and surface forms, we can appropriately constrainthe relations between them in many cases with finite state constraints applied in rank order:

s1 ... sn−1 sn

s1 ... sn−1 sn

McCarthy and Prince (1995a) suggest that in addition to these Input-Base relations, in redupli-cation there are two other kinds of relations that are of essentially the same kind:

I

BR

Spelling out the sequential nature of the representations, we see the two problems for OTP: thetemporal discontinuities of the I-R relations, and the crossing dependencies of the R-B relation.

s1 ... sn−1 sn s1 ... sn−1 sn

s1 ... sn−1 sn

What we would like is a uniform representation of I-B, I-R, and R-B relations. We do not wantto use an automaton for I-B relations and something else for the crossing R-B relations.

This pushes us to the extreme view, where we subsume everything in one system.

(27) It is easy enough to represent any finite automaton, using input i and output o components:

0 1 2

-S,-S-S,-S -S,-S

[S ,[S ]S ,]S

−Si,−So0 → i,o

0[Si,[So0 → i,o

1

+Si,+So1 → i,o

1]Si,]So1 → i,o

2

−Si,−So2 → i,o

2−S,−S2 → ε

121


(28) We can weight these productions, and although we have general way to intersect arbitraryMCFG’s, we can of course intersect “right linear” MCFG’s like this one.What other MCFG’s would we want to be able to intersect? Almost everything we have seen canbe represented with finite state power – The thing to check is whether we can get the propereffects in the non-finite-state constructions.We have the ingredients we need to enforce MaxRB and DepRB constraints in just the way weenforce other constraints, and to prefix or suffix the reduplicant. That is, it appears that wehave avoided the two problems that OTP faced: I-R temporal discontinuities and the crossingR-B relations.Using input i, base b, and reduplicant r components, we will have configurations like thefollowing. (I also include the now redundant underlining for people who have been using that.)

(prefix) i,brAfred + 0 →

r ,i,b0

(suffix) i,rbAfred + 0 →

r ,i,b0

−Sr ,−Si,−Sb0 → r ,i,b

0[Sr ,[Si,[Sb

0 → r ,i,b1

+Sr ,+Si,+Sb1 → r ,i,b

1]Sr ,]Si,]Sb

1 → r ,i,b2

−Sr ,−Si,−Sb2 → r ,i,b

2−S,−S,−S

2 → ε

This grammar does not enforce any constraints at the rb (or br ) juncture, but constraints couldbe enforced there upon intersection. We should try some examples to see that this will work.20

Another problem is perhaps more challenging. The affixation in the grammar above is donecrudely, which leaves the question of how we could properly capture infixing reduplicationphenomena. Can we implement, for example, something like the suggestion of McCarthy andPrince (1995b, p362) that this kind of phenomena be captured by ranking Onset over Left-mostness? It seems that we have all the ingredients necessary to implement any story likethis, but it will be important to check.

(29) As a first example, it is easy to show how we can use this scheme to implement the McCarthyand Prince (1995a, pp19ff) analysis of the Balangao example tagta-tagtag:

/Red-tagtag/ contigBR MaxIO NoCoda MaxBRA

tag.ta-tag.tag *** *ta.ta-tag.tag *! ** **

tag.tag-tag.tag ****!tag.ta-tag.ta *!

XXX20The simplicity of the grammar above is, I think, misleading in one important respect. It might look like it is a version

of “persistent serialism” with re-applied copying of the sort considered in McCarthy and Prince (1995a, p56), but it isnot. The grammar above just puts everything in appropriate places. (Even in this example, an infinite set of options ismade available for affixation, but in other examples the sets of available affixes will be much more various.) With thingsarranged in this way, we allow for simultaneous constraints on IR, BR, and IB relations that will determine what actuallygets affixed. The worry that remains is whether we can properly handle the affixation itself, as discussed immediatelybelow.

122


(30) This scheme also suffices for the McCarthy and Prince (1995a, pp39ff) analysis of the Javaneseexample bda-bda, even on their suffixing analysis:

/bdah-Red-e/ MaxBR *VhV MaxIOA

bda-bda-e *bdah-bdah-e *!bdah-bda-e *!

XXX

10.3 Locality reconsidered

XXX

10.4 Acquisition and grammar size

XXX

123


References

Aho, Alfred V. and Jeffrey D. Ullman. 1972. The Theory of Parsing, Translation, and Compiling. Volume 1:Parsing. Prentice-Hall, Englewood Cliffs, New Jersey.

Aho, A.W., J.E. Hopcroft, and J.D. Ullman. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Massachusetts.

Albro, Daniel M. 1997. Evaluation, implementation, and extension of primitive optimality theory. M.A. thesis,UCLA.

Albro, Daniel M. 1998. Three formal extensions to primitive optimality theory. 1998 ACL Meeting.

Albro, Daniel M. 1999. Phonological learning within an optimality theoretic framework. Ph.D. thesis proposal,UCLA.

Angluin, Dana. 1982. Inference of reversible languages. Journal of the Association for Computing Machinery,29:741–765.

Bar-Hillel, Y., M. Perles, and E. Shamir. 1961. On formal properties of simple phrase structure grammars.Zeitschrift für Phonetik, Sprachwissenschaft und Kommunicationsforschung, 14:143–172. Reprinted in Y.Bar-Hillel, Language and Information: Selected Essays on their Theory and Application. NY: Addison-Wesley,1964.

Barton, G. Edward, Robert C. Berwick, and Eric Sven Ristad. 1987. Computational Complexity and NaturalLanguage. MIT Press, Cambridge, Massachusetts.

Bird, Steven. 1995. Computational Phonology: A Constraint-Based Approach. Cambridge University Press, NY.

Bird, Steven and T. Mark Ellison. 1994. One level phonology: autosegmental representations and rules as finiteautomata. Computational Linguistics, 20:55–90.

Bromberger, Sylvain and Morris Halle. 1989. Why phonology is different. Linguistic Inquiry, 20:51–70.

Browman, Catherine and Louis Goldstein. 1989. Articulatory gestures as phonological units. Phonology, 6:201–251.

Chomsky, Noam and Morris Halle. 1968. The Sound Pattern of English. MIT Press, Cambridge, Massachusetts.

Cole, Jennifer and Charles Kisseberth. 1994. An optimal domains theory of harmony. Studies in the LinguisticSciences, 24.

Cormen, Thomas H., Charles E. Leiserson, and Ronald L. Rivest. 1991. Introduction to Algorithms. MIT Press,Cambridge, Massachusetts.

Dijkstra, Edsger W. 1959. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271.

Eilenberg, Samuel. 1974. Automata, Languages and Machines. Academic Press, NY.

Eisner, Jason. 1997a. Decomposing FootForm: Primitive constraints in OT. In Proceedings of Student Confer-ence in Linguistics, SCLI VIII, MIT Working Papers in Linguistics.

Eisner, Jason. 1997b. Efficient generation in Primitive Optimality Theory. In Proceedings of the 35th AnnualMeeting of the Association for Computational Linguistics.

Eisner, Jason. 1997c. What constraints should OT allow? Presented at the Annual Meeting of the LinguisticSociety of America, Chicago. Available at http://ruccs.rutgers.edu/roa.html, January.

Eisner, Jason. 1999. Doing OT in a straightjacket. Presented at UCLA.

124


Ellison, Mark T. 1994a. Phonological derivation in optimality theory. In Procs. 15th Int. Conf. on ComputationalLinguistics, pages 1007–1013. (Also available at the Edinburgh Computational Phonology Archive).

Ellison, T. Mark. 1994b. The iterative learning of phonological rules. Computational Linguistics, 20(3).

Ellison, T. Mark. 1997. Simplicity, psychological plausibility and connectionism in language acquisition. In Pro-ceedings of the 1997 GALA conference on Language Acquisition: Knowledge Representation and Processing,pages 333–337.

Frank, Robert andGiorgio Satta. 1998. Optimality theory and the generative complexity of constraint violability.Computational Linguistics, 24:307–315.

Gafos, Diamandis. 1996. The Articulatory Basis of Locality in Phonology. Ph.D. thesis, Johns Hopkins University.

Gafos, Diamandis. 1998. On formalization and formal linguistics. Natural Language and Linguistic Theory,16(2):223–278.

Garey, Michael R. and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, San Francisco.

Gold, E. Mark. 1967. Language identification in the limit. Information and Control, 10:447–474.

Goldsmith, John. 1976. Autosegmental Phonology. Ph.D. thesis, Massachusetts Institute of Technology.

Goldsmith, John. 1990. Autosegmental and Metrical Phonology. Basil Blackwell, Oxford.

Hayes, Bruce. 1989. The prosodic hierarchy in meter. In P. Kiparsky and G. Youmans, editors, Rhythm andMeter. Academic, NY.

Hayes, Bruce. 1990. Precompiled phrasal phonology. In Sharon Inkelas and Draga Zec, editors, The Phonology-Syntax Connection. CSLI/University of Chicago Press, Chicago.

Hayes, Bruce. 1995. Metrical Stress Theory: Principles and Case Studies. University of Chicago Press, Chicago.

Hopcroft, John E. and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages and Computation.Addison-Wesley, Reading, Massachusetts.

Johnson, C. Douglas. 1972. Formal Models of Phonology. Mouton, The Hague.

Kanazawa, Makoto. 1996. Identification in the limit of categorial grammars. Journal of Logic, Language, andInformation, 5:115–155.

Kaplan, Ronald andMartin Kay. 1994. Regularmodels of phonological rule systems. Computational Linguistics,20:331–378.

Kaplan, Ronald M. 1987. Three seductions of computational linguistics. In Pete Whitelock, Mary McGee Wood,Harold L. Somers, Rod Johnson, and Paul Bennett, editors, Linguistic Theory and Computer Applications.Academic Press, pages 149–188.

Karttunen, Lauri. 1991. Finite state constraints. In Proceedings of the International Conference on CurrentIssues in Computational Linguistics. A version of this paper also appears in John Goldsmith, ed., The LastPhonological Rule, University of Chicago Press. Chicago, Illinois, 1993.

Karttunen, Lauri. 1998. The proper treatment of optimality in computational phonology. In Proceedings of theInternational Workshop on Finite-State Methods in Natural Language Processing, FSMNLP’98.

Kornai, András. 1994. Formal Phonology. Garland, NY.

Koskenniemi, Kimmo. 1983. Two-level morphology. University of Helsinki.

Kozen, Dexter. 1977. Lower bounds for natural proof systems. In Proceedings of the 18th Annual Symposiumon the Foundations of Computer Science, pages 254–266.

125


Krenn, Brigitte and Christer Samuelsson. 1996. The linguist’s guide to statistics. Available at http://coli.uni-sb.de/˜christer/.

Lang, Bernard. 1994. Recognition can be harder than parsing. Computational Intelligence, 10.

Levy, Azriel. 1979. Basic Set Theory. Springer-Verlag, NY.

Lewis, H.R. and C.H. Papadimitriou. 1981. Elements of the Theory of Computation. Prentice-Hall, EnglewoodCliffs, New Jersey.

Maor, Eli. 1994. e: The Story of a Number. Princeton University Press, Princeton.

McCarthy, J. and Alan Prince. 1995a. Faithfulness and reduplicative identity. Technical Report OccasionalPapers, University of Massachusetts, Amherst.

McCarthy, J. and Alan Prince. 1995b. Prosodic morphology. In John A. Goldsmith, editor, The Handbook ofPhonological Theory. Blackwell, Oxford.

Miller, George A. and Noam Chomsky. 1963. Finitary models of language users. In R. Duncan Luce, Robert R.Bush, and Eugene Galanter, editors, Handbook of Mathematical Psychology, Volume II. Wiley, NY, pages419–492.

Mohri, Mehryar. 1997. Finite-state transducers in language and speech processing. Computational Linguistics,23.

Mohri, Mehryar, Fernando C. N. Pereira, and Michael Riley. 1998. A rational design for a weighted finite-statetransducer library. In Lecture Notes in Computer Science No. 1436. Springer, NY.

Moll, R.N., M.A. Arbib, and A.J. Kfoury. 1988. An Introduction to Formal Language Theory. Springer-Verlag,NY.

Morwietz, Frank and Tom Cornell. 1997a. Approximating principles and parameters grammars with MSO treelogics. In Proceedings of the 1997 Meeting of the Logical Aspects of Computational Linguistics.

Morwietz, Frank and Tom Cornell. 1997b. On the recognizability of relations over a tree definable in a monadicsecond order tree description language. Technical report 85, SFB 340, University of Tübingen.

Morwietz, Frank and Tom Cornell. 1997c. Representing constraints with automata. In Proceedings of the 1997Meeting of the Association for Computational Linguistics.

Nelson, Mark and Jean-Loup Gailly. 1996. The Data Compression Book. M&T Books, NY.

Pereira, Fernando C.N. and Michael D. Riley. 1997. Speech recognition by composition of weighted finiteautomata. In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Processing. MIT Press,Cambridge, Massachusetts.

Perrin, Dominique. 1990. Finite automata. In J. van Leeuwen, editor, Handbook of Theoretical ComputerScience. Elsevier, NY, pages 1–57.

Prince, Alan and Paul Smolensky. 1993. Optimality theory: Constraint interaction in generative grammar.Forthcoming.

Rissanen, Jorma. 1976. Generalized Kraft inequality and arithmetic coding. IBM Journal of Research andDevelopment, 20:198–203.

Roche, Emmanuel and Yves Schabes. 1997a. Deterministic part-of-speech tagging with finite-state transducers.In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Processing. MIT Press, Cambridge,Massachusetts.

Roche, Emmanuel and Yves Schabes. 1997b. Introduction. In Emmanuel Roche and Yves Schabes, editors,Finite-State Language Processing. MIT Press, Cambridge, Massachusetts.

126


Rogers, James. 1995. On descriptive complexity, language complexity, and GB. Available atftp://xxx.lanl.gov/cmp-lg/papers/9505/9505041.

Savage, John E. 1976. The Complexity of Computing. Wiley, New York.

Sayood, Khalid. 1996. Introduction to Data Compression. Morgan Kaufmann, San Francisco.

Schützenberger, M. P. 1961. A remark on finite transducers. Information and Control, 4:185–196.

Seki, Hiroyuki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. 1991. On multiple context-free gram-mars. Theoretical Computer Science, 88:191–229.

Selkirk, Elisabeth. 1996. Sentence prosody: intonation, stress, and phrasing. In John A. Goldsmith, editor, TheHandbook of Phonological Theory. Blackwell, Oxford.

Shannon, Claude E. 1948. The mathematical theory of communication. Bell System Technical Journal, 127:379–423. Reprinted in Claude E. Shannon and Warren Weaver, editors, The Mathematical Theory of Communi-cation, Chicago: University of Illinois Press.

Sikkel, Klaas and Anton Nijholt. 1997. Parsing of context free languages. In G. Rozenberg and A. Salomaa,editors, Handbook of Formal Languages, Volume 2: Linear Modeling. Springer, NY, pages 61–100.

Solomonoff, R.J. 1964. A formal theory of inductive inference. Information and Control, 7:1–22 and 224–254.

Stabler, Edward P. 1984. Berwick and Weinberg on linguistics and computational psychology. Cognition,17:155–179.

Stabler, Edward P. 1992. The Logical Approach to Syntax: Foundations, specifications and implementations.MIT Press, Cambridge, Massachusetts.

Steriade, Donca. 1995. Underspecification and markedness. In John A. Goldsmith, editor, The Handbook ofPhonological Theory. Blackwell, Oxford.

Stolcke, Andreas. 1995. An efficient probabilistic context-free parsing algorithm that computes prefix proba-bilities. Computational Linguistics, 21:165–201.

Storer, James A. 1988. Data Compression: Methods and Theory. Computer Science Press, Rockville, Maryland.

Tesar, Bruce and Paul Smolensky. 1998. Learnability in optimality theory. Linguistic Inquiry, 29:229–268.

van Benthem, Johan. 1991. Generalized quantifiers and generalized inference. In J. van der Does and J. vanEijck, editors, Generalized Quantifier Theory and Applications. Dutch Network for Language, Logic andInformation, Amsterdam.

Vijay-Shanker, K. and David Weir. 1993. The use of shared forests in tree adjoining grammar parsing. InProceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics,pages 384–393.

Vitányi, Paul and Ming Li. 1997ms. Minimum description length induction, Bayesianism, and Kolmogorovcomplexity. forthcoming.

Viterbi, Andrew J. 1967. Error bounds for convolutional codes and an asymptotically optimum decodingalgorithm. IEEE Transactions on Information Theory, IT-13:260–269.

Wallace, C.S. 1990. Classification byminimum-message-length inference. In S.G. Akl, F. Fiala, andW.W. Koczko-daj, editors, Advances in Computing and Information, ICCI’90. Proceedings of the International Conferenceon Computing and Information, Lecture Notes in Computer Science 468, pages 72–81, NY: Springer-Verlag.

Watson, Bruce W. 1993. A taxonomy of deterministic finite automata minimization algorithms. Computingscience report 93/44, Eindhoven University of Technology.

127


Weaver, Warren. 1949. Recent contributions to the mathematical theory of communication. In Claude E.Shannon and Warren Weaver, editors, The Mathematical Theory of Communication. University of IllinoisPress, Chicago.

Yu, Sheng. 1997. Regular languages. In G. Rozenberg and A. Salomaa, editors, Handbook of Formal Languages,Volume 1: Word, Language, Grammar. Springer, NY, pages 41–110.

Zec, Draga and Sharon Inkelas. 1990. Prosodically constrained syntax. In Sharon Inkelas and Draga Zec,editors, The Phonology-Syntax Connection. CSLI/University of Chicago Press, Chicago.

Zec, Draga and Sharon Inkelas. 1995. Syntax-phonology interface. In John A. Goldsmith, editor, The Handbookof Phonological Theory. Blackwell, Oxford.

128

contentslinguistics.ucla.edu/people/stabler/236.pdf · 2017-03-14 · volume b. salomaa 1973 formal...

Documents