ling 438/538 computational linguistics sandiway fong lecture 14: 10/12

30
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

LING 438/538Computational Linguistics

Sandiway Fong

Lecture 14: 10/12

2

Administrivia

• Reminder– Homework 3 due tonight

3

Last Time

• morphology– words are composed of morphemes – morpheme: semantic unit, e.g. -ee in employee– Inflectional: no change in category, e.g. V -ed V– Derivational: category-changing, e.g. V -able A

• Porter Stemmer– normalization procedure– based on (manually determined) ad hoc rules– “measure” of a stem: C(VC)mV– output: “root” (not necessarily a word)

• words that stem to the same root are considered “variants”

– English orthography

• an illustration of the gap that can occur between computation and linguistic theory

4

Walkers. Standees.

© Sandiway Fongsign above travelatorat Pittsburgh International Airport

5

Today’s Topic

• Finite State Transducers (FST) for morphological processing

– ... also Prolog implementation

6

Recall Finite State Automata (FSA)

• from lecture 8– (Q,s,f,Σ,)1. set of states (Q): {s,x,y} must be a finite set2. start state (s): s3. end state(s) (f): y

4. alphabet (Σ): {a, b}5. transition function :

signature: character × state → state1. (a,s)=x2. (a,x)=x3. (b,x)=y4. (b,y)=y

s x

y

aa

b

b

7

Modeling English Adjectives using FSA

– from section 3.2 of textbook

• examples– big, bigger, biggest, *unbig– cool, cooler, coolest, coolly– red, redder, reddest, *redly– clear, clearer, clearest, clearly, unclear, unclearly– happy, happier, happiest, happily– unhappy, unhappier, unhappiest, unhappily– real, *realer, *realest, unreal, really

• fsa (3.4)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Initial machineis overly simple

need more classesto make finer grain distinctions

e.g. *unbig

8

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Modeling English Adjectives using FSA

• divide adjectives into classes• examples

– adj-root2: big, bigger, biggest, *unbig– adj-root2: cool, cooler, coolest, coolly– adj-root2: red, redder, reddest, *redly– adj-root1: clear, clearer, clearest, clearly, unclear, unclearly– adj-root1: happy, happier, happiest, happily– adj-root1: unhappy, unhappier, unhappiest, unhappily– adj-root1: real, *realer, *realest, unreal, really

• fsa (3.5)

However...Examplesuncooler •Smoking uncool and getting uncooler.•google: 22,800 (2006), 10,900 (2005) *realer •google: 3,500,000 (2006) 494,000 (2005)

*realest •google: 795,000 (2006) 415,000 (2005)

9

Modeling English Adjectives using FSA

e.g. *unbig google: 11,000 hits (2006)

morphology is productivemorphemes carry (compositional) meaningcan be used for dramatic effect unbig vs. small

10

The Mapping Problem

• To map between a surface form and the decomposition of a word into its components– e.g. root + (person/number/gender) and other features

• using spelling rules

• Example: (3.11)

Notes:^ marks a morpheme boundary# is the end-of-word marker

11

Stage 1: Lexical Intermediate Levels

• example:– f o x +N +PL (lexical)– f o x ^s# (intermediate)

• lexical level: – uninflected “dictionary” level

• intermediate level: – replace abstract morphemes by concrete ones

• key– +N: noun

• fox can also be a verb, • but fox +V cannot combine with +PL

– +PL: (abstract) plural morpheme• realized in English as s (basic case)

– boundary markers ^ and # • for use by the spelling rule machine (later)

12

Stage 1: Lexical Intermediate Levels

• example:– f o x +N +PL (lexical)– f o x ^s# (intermediate)

• machine idea – character-by-character correspondences– f f – o o– x x– +N ( = empty string)– +PL ^s#

• use a Finite State Machine with input/output mapping– Finite State Transducer (FST)

13

Stage 1: Lexical Intermediate Levels

• Example:– g o o s e +N +PL (lexical)– g e e s e # (intermediate)

• Example:– g o o s e +N +SG (lexical)– g o o s e # (intermediate)

• Example:– m o u s e +N +PL (lexical)– m i c e # (intermediate)

• Example:– s h e e p +N +PL (lexical)– s h e e p # (intermediate)

14

Stage 1: Lexical Intermediate Levels

• 3.11

Notation:

input : output

f means f:f

15

Extension to Finite State Transducers (FST)

• [Mealy machine extension to FSA]– (Q,s,f,Σ,)1. set of states (Q): {s,x,y} must be a finite set2. start state (s): s3. end state(s) (f): y

4. alphabet (Σ): pairs I:O– I = input alphabet, O = output alphabet

– ε may be included in I and O

– transition function (or matrix) : signature: i/o pair × state → state1. (a:b,s)=x2. (a:b,x)=x3. (b:a,x)=y4. (b:ε,y)=y

s x

y

a:b a:b

b:ε

b:a

16

Finite State Automata (FSA)

• recall: one possible Prolog encoding strategy

– define one predicate for each state• taking one argument (the input string)• consume input character• call next state with remaining input string

– query•?- s(L).

call start state s

17

Finite State Automata (FSA)

– from lecture 9

– define one predicate for each state• take one argument (the input string), and consume input character• call next state with remaining input string

– query• ?- s(L). i.e. call start state s

– state s: (start state)• s([a|L]) :- x(L).

– state x:• x([a|L]) :- x(L).• x([b|L]) :- y(L).

– state y: (end state)• y([]).• y([b|L]) :- y(L).

s x

y

aa

b

b

simple extension to FST: each predicate takes two arguments:input and output

18

Stage 1: Lexical Intermediate Levels

• example– s0([f|L1],[f|L2]) :- s1(L1,L2).– s0([c|L1],[c|L2]) :- s3(L1,L2).

– s1([o|L1],[o|L2]) :- s2(L1,L2).– s2([x|L1],[x|L2]) :- s5(L1,L2).– s3([a|L1],[a|L2]) :- s4(L1,L2).– s4([t|L1],[t|L2]) :- s5(L1,L2).

– s5([‘+N’|L1],L2) :- s6(L1,L2).– s6([‘+PL’|L1],[^,s,#|L2]) :- s7(L1,L2).– s7([],[]). % end state

19

Stage 1: Lexical Intermediate Levels

• FST queries– lexical intermediate

• ?- s0([f,o,x,’+N’,’+PL’],X).– X = [f, o, x, ^, s, #]

– intermediate lexical • ?- s0(X,[f,o,x,^,s,#]).

– X = [f, o, x, '+N', '+PL']

– enumerator• ?- s0(X,Y).

– X = [f, o, x, '+N', '+PL']– Y = [f, o, x, ^, s, #] ;– X = [c, a, t, '+N', '+PL']– Y = [c, a, t, ^, s, #] ;

• No

inversion of a transducer T: T-1

switch input and output labels

in Prolog, simply change the call

20

Stage 1: Lexical Intermediate Levels

• Figure 3.17 (top half):tape view of input/output pairs

21

The Mapping Problem

• Example: (3.11)

• (Context-Sensitive) Spelling Rule: (3.5) e / {x,s,z}^__ s#

rewrites to letter e in left context x^ or s^ or z^ and right context s#

• i.e. insert e after the ^ when you see x^s# or s^s# or z^s#

• in particular, we have x^s# x^es#

22

Stage 2: Intermediate Surface Levels

• also can be implemented using a FSTimportant!machine is designed to pass input not matching the rule through unmodified (rather than fail)

implements context-sensitive ruleq0 to q2 : left contextq3 to q0 : right context

23

Stage 2: Intermediate Surface Levels

• Example (3.17)

24

Stage 2: Intermediate Surface Levels

• Transition table for FST in 3.14

• Note:– other: (catch-all case) means pass any remaining symbol (other than

specified explicitly in the state) to the other side unchanged– #: # is never included in other

25

Stage 2: Intermediate Surface Levels

• in Prolog (simplified)– with special treatment for “other”– q0([],[]). % final state– q0([^|L1],L2) :- !, q0(L1,L2). – % ^: – q0([z|L1],[z|L2]) :- !, q1(L1,L2). – % repeat for s,x– q0([#|L1],[#|L2]) :- !, q0(L1,L2).– q0([X|L1],[X|L2]) :- q0(L1,L2). – % other

• ! is known as the “cut” predicate– it affects how Prolog searches– it means “cut” the search off– Prolog will not try any other compatible rule on

backtracking– problematic for generation, e.g. ^: case

26

Stage 2: Intermediate Surface Levels

• in Prolog (simplified)– with special treatment for “other”– q0([],[]). % final state– q0([^|L1],L2) :- !, q0(L1,L2). – % ^: – q0([z|L1],[z|L2]) :- !, q1(L1,L2). – % repeat for s,x– q0([#|L1],[#|L2]) :- !, q0(L1,L2).– q0([X|L1],[X|L2]) :- q0(L1,L2). – % other

• ! is known as the “cut” predicate– it affects how Prolog searches– it means “cut” the search off– Prolog will not try any other compatible rule on

backtracking– problematic for generation, e.g. ^: case

1

2

3

backtrack points: other choices

27

Stage 2: Intermediate Surface Levels

• problem for generation– ?- q0(X,[f,o,x,e,s,#]). X = [^|L1]

• ?- q0(L1,[f,o,x,e,s,#]).L1 = [^|L1’]– ?- q0(L1’,[f,o,x,e,s,#]).

– infinite loop– Culprit: ^: case (morpheme boundary deletion)– can keep introducing ^^^^^^^... ad infinitum– requires more than finite state power to correct

q0([],[]). % final stateq0([^|L1],L2) :- !, q0(L1,L2). % ^: q0([z|L1],[z|L2]) :- !, q1(L1,L2). % repeat for s,xq0([#|L1],[#|L2]) :- !, q0(L1,L2).q0([X|L1],[X|L2]) :- q0(L1,L2). % other

q0([],[]). % final stateq0([^|L1],L2) :- !, q0(L1,L2). % ^: q0([z|L1],[z|L2]) :- !, q1(L1,L2). % repeat for s,xq0([#|L1],[#|L2]) :- !, q0(L1,L2).q0([X|L1],[X|L2]) :- q0(L1,L2). % other

28

Stage 2: Intermediate Surface Levels

• Other cases of ^: do not loop. Could eliminate just the loop case.

29

Stage 2: Intermediate Surface Levels

• query (generation)– ?- q0(X,[c,a,t,s,#]).

• X = [c, a, t, s, ^, #] ; q0+ -> q1 -> q2 -> q0• X = [c, a, t, s, #] ; q0+ -> q1 -> q0 • No

30

Looking ahead

• Read Chapter 5: Probabilistic Models of (Pronunciation and) Spelling