computational lexicology, morphology and syntax diana trandab ă ț academic year 2014-2015

66
Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2014-2015

Upload: randell-walker

Post on 12-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Computational Lexicology, Morphology and Syntax

Diana TrandabățAcademic year 2014-2015

Page 2: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Today’s Topics• Finite State Technology

• Regular Languages and Relations

• Review of Set Theory

• Understand the mathematical operations that can be performed on such Languages.

• Understand how Languages, Relations, Regular Expressions, and Networks are interrelated.

Page 3: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

What is Finite State Technology?

• Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems.

• Such Techniques include– Design of user languages for specifying FSA– Compilation of such languages into efficient

transition networks.– Development environments and runtime systems

Page 4: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

What is Finite-State Technology Good For?

• Finite-state techniques cannot handle central embedding– the man the dog the cat bit followed ate.

• They are well suited to “lower-level” natural language processing such as – Tokenization – what is the next word?– Spelling error detection: does the next word

belong to a list?– Morphological/phonological analysis/generation– Shallow syntactic parsing and “chunking”

Page 5: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Languages,Notations and Machines

LANGUAGE(set of strings)

NOTATION MACHINE

Page 6: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Languages,Notations and Machines

FINITE STATELANGUAGE

FINITE STATENOTATION

FINITE STATEAUTOMATON

Page 7: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

FINITE STATE AUTOMATA:preliminary definition

• A finite state automaton includes:• A finite set of states• A finite set of labelled transitions between states

Page 8: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Physical Machines with Finite States

• The Lightswitch Machine

OFF ON

UP

DOWN

Page 9: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Physical Machines with Finite States

• The Lightswitch Toggle Machine

OFF ON

PUSH

PUSH

Page 10: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

The Five Cent Machine

• Problem:• Assume you have one, two, and five cent

pieces• Design a finite state automaton which accepts

exactly 5 cents.

Page 11: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

The Cola Machine

• Need to enter 25 cents (USA) to get a drink• Accepts the following coins:

– Nickel = 5 cents– Dime = 10 cents– Quarter = 25 cents

• For simplicity, our machine needs exact change

• We will model only the coin-accepting mechanism

Page 12: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Physical Machines with Finite States

• The Cola Machine

0

N

D

Q

N N NN

D D D

5 15 20 25

Start State Final State

10

Page 13: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

The Cola Machine Language• List of all the sequences of coins accepted:

– { Q, DDN, DND, NDD, DNNN, NDNN, NNDNNNND, NNNNN }

• Think of the coins as SYMBOLS or CHARACTERS

• The set of symbols accepted is the ALPHABET of the machine

• Think of sequences of coins as WORDS or “strings”

• The set of words accepted by the machine is its LANGUAGE

Page 14: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

FINITE STATE AUTOMATA:better definition

• A finite state automaton includes:• A finite set of states

– Initial State– Final State (s)

• A finite set of labelled transitions beween states• Labels are symbols from an alphabet• Recognises a language• Generates a language as well!

Page 15: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

A Network that Accepts aOne Word Language

c a n t o

Page 16: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

A Network that Accepts aThree Word Language

ca n t

o

t i g r e

m e s a

Page 17: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Scaling Up the Network

• Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language.

• We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language

• This is the basis for a Spanish spelling error detector.

Page 18: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Looking Up a Word

ca n t

o

t i g r e

m e s a

m e s a“Apply”

Page 19: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Lookup Failure

• Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because:

• Not all input is consumed ("libro", "tigra")• Input is fully consumed but state is not final

("cant")• Final state is reached but there is still

unconsumed output ("mesas")

Page 20: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Shared Structure

c l e a

e

v

r

e

Page 21: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Transducers

m e s a s“Lookup”

“Lookdown”

m e s a +Noun +Fem +Pl

m e s a 0 0 s

mesa+Noun+Fem+Pl

Page 22: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

A Morphological Analyzer

Transducer

dogs

dog +n +pl

Page 23: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

A Morphological Analyzer

Transducer

Surface Language

Lexical Language

Page 24: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

A Quick Review of Set Theory• A set is a collection of objects.

A B

D E

We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }.

There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc.

Page 25: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Uniqueness of Elements• You cannot have two or more • ‘A’ elements in the same set

A B

D E

{ A, A, D, B, E} is just a redundant specification of the set

{ A, D, B, E }.

Page 26: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Cardinality of Sets

• The Empty Set:

• A Finite Set:

• An Infinite Set: e.g. The Set of all Positive Integers

Norway Denmark Sweden

Page 27: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Simple Operations on Sets: Union

A B

C

DE

Set 1 Set 2

B C A D E

Union of Set1 and Set 2

Page 28: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Simple Operations on Sets (2): Union

A B

C

CD

Set 1 Set 2

B C A D

Union of Set1 and Set 2

Page 29: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Simple Operations on Sets (3): Intersection

A B

C

CD

Set 1 Set 2

C

Intersection of Set1 and Set 2

Page 30: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Simple Operations on Sets (4): Subtraction

A B

C

CD

Set 1 Set 2

A B

Set 1 minus Set 2

Page 31: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Formal Languages

Very Important Concept in Formal Language Theory:

A Language is just a Set of Words.

• We use the terms “word” and “string” interchangeably.

• A Language can be empty, have finite cardinality, or be infinite in size.

• You can union, intersect and subtract languages, just like any other sets.

Page 32: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Union of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

dog cat rat

elephant mouse

Union of Language 1 and Language 2

Page 33: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Intersection of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

Page 34: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Intersection of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

rat

Page 35: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Subtraction of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Language 1 minus Language 2

dog cat

Page 36: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Languages

• A language is a set of words (=strings).

• Words (strings) are composed of symbols (letters) that are “concatenated” together.

• At another level, words are composed of “morphemes”.

• In most natural languages, we concatenate morphemes together to form whole words.

For sets consisting of words (i.e. for Languages), the operation of concatenation is very important.

Page 37: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Concatenation of Languages

work talk walk

Root Language

0 ing ed s

Suffix Language

work working worked works talk talking talked talks walk walking walked walks

The concatenation of the Suffix language after the Root language.

Page 38: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Languages and Networks

w a l k

o r

t

Network/Language 1

Network/Language 2

s

o r

s The concatenation of Network 1 and Network 2

w a l k

t

a

as

ed

i n g

0

s

ed

i n g

0

s

Page 39: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Why is “Finite State” Computing so Interesting?

• Finite-state systems are mathematically elegant, easily manipulated and modifiable.

• Computationally efficient. Usually very compact.• The programming linguists do is declarative. We describe

the facts of our natural language; i.e. we write grammars. We do not hack ad hoc code.

• The runtime code, which applies our systems to linguistic input, is already written and it is completely language-independent.

• Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate.

Page 40: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Languages,Notations and Machines

FINITE STATELANGUAGE

FINITE STATENOTATION

FINITE STATEMACHINE

Page 41: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

41

Regular ExpressionsAbstract Definition

• 0 is a regular expression• ε is a regular expression• if α ϵ Σ is a letter then α is a regular

expression• if Ψ and Φ are regular expressions then so

are (Ψ + Φ) and (Ψ . Φ)• if Φ is a regular expression then so is (Φ)*• Nothing else is a regular expression

Page 42: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

42

Searching Text

• Analysis of written texts often involves searching for (and subsequent processing of):– a particular word – a particular phrase– a particular pattern of words involving gaps

• How can we specify the things we are searching for?

Page 43: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

43

Regular Expressions and Matching

• A Regular Expression is a special notation used to specify the things we want to search for.

• We use regular expressions to define patterns.• We then implement a matching operation

m(<pattern>,<text>) which tries to match the pattern against the text.

Page 44: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

44

Simple Regular Expressions

• Most ordinary characters match themselves.• For example, the pattern sing exactly

matches the string sing.• In addition, regular expressions provide us

with a set of special characters

Page 45: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

45

The Wildcard Symbol

• The “.” symbol is called a wildcard: it matches any single character.

• For example, the expression s.ng matches sang, sing, song, and sung.

• Note that "." will match not only alphabetic characters, but also numeric and whitespace characters.

• Consequently, s.ng will also match non-words such as s3ng

Page 46: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

46

Assignment

• Draw the FSM which corresponds tos.ng

Page 47: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

47

Repeated Wildcards

• We can also use the wildcard symbol for counting characters. For instance ....zy matches six-letter strings that end in zy.

• The pattern t... will match the words that and term

• It will also match the word sequence to a (since the third "." in the pattern can match the space character).

Page 48: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

48

Optionality

• The “?” symbol indicates that the immediately preceding regular expression is optional. The regular

• expression colou?r matches both British and American spellings, colour and color.

Page 49: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

49

Repetition

• The "+" symbol indicates that the immediately preceding expression is repeatable at least once

• For example, the regular expression "coo+l" matches cool, coool, and so on.

• This symbol is particularly effective when combined with the . symbol. For example,

• f .+ f matches all strings of length greater than two, that begin and end with the letter f (e.g foolproof).

Page 50: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

50

Repetition 2

• The “*” symbol indicates that the immediately preceding expression is both optional and repeatable.

• For example .*gnt.* matches all strings that contain gnt.

Page 51: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

51

Character Class

• The [ ] notation enumerates the set of characters to be matched is called a character class.

• For example, we can match any English vowel, but no consonant, using [aeiou].

• We can combine the [] notation with our notation for repeatability.

• For example, expression p[aeiou]+t matches peat, poet, and pout.

Page 52: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

52

The Choice Operator

• Often the choices we want to describe cannot be expressed at the level of individual characters.

• In such cases we use the choice operator "|" to indicate the alternate choices.

• The operands can be any expression.• For instance, jack | gill will match either jack

or gill.

Page 53: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

53

Choice Operator 2

• Note that the choice operator has wide scope, so that abc|def is a choice between abd and def, and not between abcef and abdef.

• The latter choice must be written using parentheses: ab(c|d)ef

Page 54: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

54

Ranges

• The [ ] notation is used to express a set of choices between individual characters.

• Instead of listing each character, it is also possible to express a range of characters, using the - operator.

• For example, [a-z] matches any lowercase letter

Page 55: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

55

Exercise

Write regular expressions matching• All 1 digit numbers• All 2 digit numbers• All date expressions such as 12/12/1950

Page 56: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

56

Ranges II

• Ranges can be combined with other operators.

• For example [A-Z][a-z]* matches words that have an initial capital letter followed by any number of lowercase letters.

• Ranges can be combined as in [A-Za-z] which matches any alphabetical character.

Page 57: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

57

Assignment

• What does the following expression match?• [b-df-hj-np-tv-z]+

Page 58: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

58

Complementation

• the character class [b-df-hj-np-tv-z] allows us to match consonants.

• However, this expression is quite cumbersome.

• A better alternative is to say: let’s match anything which isn’t a vowel.

• To do this, we need a way of expressing complementation.

Page 59: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

59

Complementation 2

• We do this using the symbol “^” as the first character within the class expression [ ].

• [^aeiou] is just like our earlier character class, except now the set of vowels is preceded by ^.

• The expression as a whole is interpreted as matching anything which fails to match [aeiou]

• In other words, it matches all lowercase consonants (plus all uppercase letters and non-alphabetic

Page 60: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

60

Complementation 3

• As another example, suppose we want to match any string which is enclosed by the HTML tags for boldface, namely <B> and </B>, We might try something like this: <B>.*</B>.

• This would successfully match <B>important</B>, but would also match <B>important</B> and <B>urgent</B>, since the <.*y subpattern will happily match all the characters from the end of important to the end of urgent.

Page 61: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

61

Complementation 4

• One way of ensuring that we only look at matched pairs of tags would be to use the expression <B>[^<]*</B>, where the character class matches anything other than a left angle bracket.

• Finally, note that character class complementation also works with ranges. Thus [^a-z] matches anything other than the lower case alphabetic characters a through z.

Page 62: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

62

Other Special Symbols

• Two important symbols in this are “^” and “$” which are used to anchor matches to the beginnings or ends of lines in a file.

• Note: “^” has two quite distinct uses: it is interpreted as complementation when it occurs as the first symbol within a character class, and as matching the beginning of lines when it occurs elsewhere in a pattern.

Page 63: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

63

Special Symbols 2

• As an example, [a-z]*s$ will match words ending in s that occur at the end of a line.

• Finally, consider the pattern ^$; this matches strings where no character occurs between the beginning and the end of a line — in other words, empty lines.

Page 64: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

64

Special Symbols 3• Special characters like “.”, “*”, “+” and “$” give us powerful

means to generalise over character strings.• Suppose we wanted to match against a string which itself

contains one or more special characters?• An example would be the arithmetic statement $5.00 * ($3.05

+ $0.85). • In this case, we need to resort to the so-called escape

character “\” (“backslash”).• For example, to match a dollar amount, we might use \$[1-9]

[0-9]*\.[0-9][0-9]

Page 65: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

65

Summary

• Regular Expressions are a special notation.• Regular expressions describe patterns which

can be matched in text.• A particular regular expression E stands for a

set of strings. We can thus say that E describes a language.

Page 66: Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year 2014-2015

Great!

See you next time!