linguistica. powerpoint? this presentation borrows heavily from slides written by john goldsmith who...

Linguistica

Powerpoint?

This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks, John.

He also says I should enjoy my trip, and one way to do that is to not have to write as many slides while I’m here!

Linguistica

A C++ program that runs under Windows, Mac OS X, and Linux that is available at:

http://humanities.uchicago.edu/ faculty/goldsmith/

There are explanations, papers, and other downloadable tools available there.

http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000/





References (for the 1st part)

Goldsmith (2001) “Unsupervised Learning of the Morphology of a Natural Language” Computational Linguistics

Overview

Look at Linguistica in action:

English, French Theoretical foundations Underlying heuristics Further work

Linguistica

A program that takes in a text in an “unknown” language…

…and produces a morphological analysis:a list of stems, prefixes, suffixes;more deeply embedded morphological

structure;regular allomorphy

Linguistica

Actions and outlines of information

Here: lists of stems, affixes, signatures, etc.

Here: some messagesfrom the analyst to theuser.

Read a corpus

Brown corpus: 1,200,000 words of typical English

French Encarta or anything else you like, in a text file. Set the number of words you want read,

then select the file.

A stem’s signature is the list of suffixes it appears with in the corpus,in alphabetical order.

abilit ies.y abilities, abilityaboli tion abolitionabsen ce.t absence, absentabsolute NULL.ly absolute, absolutely

List of stems

List of signatures

Signature: NULL.ed.ing.sfor example,account accounted accounting accountsadd added adding adds

Signature <e>ion . NULL

composite concentrate corporate détente discriminate evacuate inflate oppositeparticipate probate prosecute tense

What is this?

composite and composition

composite composit composit + ion

It infers that ion deletes a stem-final ‘e’ before attaching.

We’ll see how we can find a more sophisticated signature…

Top signatures in English

Over-arching theory

The selection of a grammar, given the data, is an optimization problem.

Optimization means finding a maximum or minimum of some objective function

Minimum Description Length provides us with a means for understanding grammar selection as minimizing a function.

(We’ll get to MDL in a moment)

What’s being minimized by writing a good morphology? The number of letters is part of it

Compare:

Naive Minimum Description Length

Corpus:

jump, jumps, jumping

laugh, laughed, laughing

sing, sang, singing

the, dog, dogs

total: 61 letters

Analysis:

Stems: jump laugh sing sang dog (20 letters)

Suffixes: s ing ed (6 letters)

Unanalyzed: the (3 letters)

total: 29 letters.

Notice that the description length goes UP if we analyze sing into s+ing

Minimum Description Length (MDL)

Rissanen (1989) (not a CL paper) The best “theory” of a set of data is the

one which is simultaneously:1. most compact or concise, and2. provides the best modeling of the data

“Most compact” can be measured in bits, using information theory

“Best modeling” can also be measured in bits…

Essence of MDL

0

100000

200000

300000

400000

500000

600000

700000

Best analysis Elegant theorythat works badly

Complex theorymodeled from

data

Length of morphologyLog prob of corpus

Description Length =

Conciseness: Length of the morphology. It’s almost as if you count up the number of symbols in the morphology (in the stems, the affixes, and the rules).

Length of the modeling of the data. We want a measure which gets bigger as the morphology is a worse description of the data.

Add these two lengths together = Description Length

Conciseness of the morphology

Sum all the letters, plus all the structure inherent in the description, using information theory.

Entropy was the weighted (by p(x)) sum of the information content or optimal compressed length (–log2 p(x)) of x. It’s called that because it is always possible to develop a compression scheme by which a symbol x, emitted with probability p(x), is represented by a placeholder of length -log2 p(x) bits.

Remember Entropy?

€

H(X) = − p(x)log2 p(x)x∈X

∑

Optimal Compressed Length

The reason this is mentioned is that we will have lots of pieces of information in our model, and we’d like to figure out how much “space” it takes up.

Remember, we want the smallest model possible, so we are going to want the best compression for anything in our model

Also, remember this:

€

−log p(x) = log1

p(x)

Conciseness of stem list and suffix list

€

(ii) Suffix list λ* | f | + log[WA ]

[ f ]

⎛

⎝ ⎜

⎞

⎠ ⎟

f ∈Suffixes

∑

€

(iii) Stem list : λ* | t | + log([W ]

[t])

⎛

⎝ ⎜

⎞

⎠ ⎟

t∈Stems

∑

Number of letters in stem

cost of setting upthis entity: lengthof pointer in bits

Number of letters in suffix

= number of bits/letter < 5

Signature list length

€

log[W ]

[σ ]σ ∈Signatures

∑ list of pointers to signatures

€

+ log < stems(σ ) > + log < suffixes(σ ) >σ ∈Signatures

∑

€

+ ( log[W ]

[t]t∈Stems(σ )

∑σ ∈Sigs

∑ + log[σ ]

[ f in σ ]f ∈Suffixes(σ )

∑ )

<X> indicates the numberof distinct elements in X

Length of the modeling of the data

Probabilistic morphology: the measure: -1 * log probability ( data )

where the morphology assigns a probability to any data set.

This is known in information theory as the optimal compressed length of the data (given the model).

Probability of a data set?

A grammar can be used not (just) to specify what is grammatical and what is not, but to assign a probability to each string (or structure).

If we have two grammars that assign different probabilities, then the one that assigns a higher probability to the observed data is the better one.

This follows from the basic principle of rationality in the Universe:

Maximize the probability of the observed data.

From all this, it follows:

There is an objective answer to the question: which of two analyses of a given set of data is better?

However, there is no general, practical guarantee of being able to find the best analysis of a given set of data.

Hence, we need to think of (this sort of) linguistics as being divided into two parts:

An evaluator (which computes the Description Length); and

A set of heuristics, which create grammars from data, and which propose modifications of grammars, in the hopes of improving the grammar.

(Remember, these “things” are mathematical things: algorithms.)

Let’s step back for a minute

Why is this problem so hard at first? Because figuring out the best analysis of

any given word generally requires having figured out the rough outlines of the whole overall morphology. (Same is true for other parts of the grammar!).

How do we start?

You all know the answer to this question already…

We start with Zellig Harris’ successor frequency!

Although we got some good answers, we also saw that it made lots of mistakes

So…

As a boot-strapping method to construct a first approximation of the signatures: Harris’ method is pretty good. We accept only stems of 5 letters or more; Only cuts where the SuccFreq is > 1, and

where the neighboring SuccFreq is 1. (This setup was experiment 16 from the

lab on Monday)

Let’s look at how the work is done (in the abstract), step by step...

Corpus

Pick a large corpus from a language --5,000 to 1,000,000 words.

Corpus

Bootstrap heuristicFeed it into the “bootstrapping” heuristic...

Corpus

Out of which comes a preliminary morphology,which need not be superb.Morphology

Bootstrap heuristic

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

Feed it to the incrementalheuristics (…which wehaven’t seen yet)

Corpus

Morphology

Bootstrap heuristic


modified morphology

Out comes a modifiedmorphology.

Corpus

Morphology

Bootstrap heuristic


modified morphology

Is the modificationan improvement?Ask MDL!

Corpus

Morphology

Bootstrap heuristic

modified morphology

If it is an improvement,replace the morphology...

Garbage

Corpus

Bootstrap heuristic


modified morphology

Send it back to theincremental heuristics again...

Morphology


modified morphology

Continue until there are no improvementsto try.

The details of learning morphology

There is nothing sacred about the particular choice of heuristic steps

Steps Successor Frequency: strict Extend signatures to cases where a word

is composed of a known stem and a known suffix.

Loose fit: Look at all unanalyzed words. Look to see if they can cut: stem + suffix, where the suffix already exists. Do this in all possible ways. See if any of these lead to stems with signatures that already exist. If so, take the “best” one. If not, compute the utility of the signature using MDL.

Check existing signatures: Using MDL to find best stem/suffix cut. Examples…

Check signatures (English)

on/ve → ion/ive an/en → man/men l/tion → al/ation m/t → alism/alist, etc.

How?

Check signatures

Signature l/tion with stems:federa inaugura orienta substantiaWe need to compute the Description Length

of the analysis as it stands versusas it would be if we shifted varying parts of

the stems to the suffixes.

“Check signatures” French:

NULL nt r >> a ant ar NULL nt >> i int ent t >> oient oit NULL r >> i ir f on ve >> sif sion sive eur ion >> seur sion ce t >> ruce rut se x >> ouse oux l ux >> al aux

me te >> ume ute eurs ion >> teurs tion f ve >> dif dive it nt >> ait ant que sme >> ïque ïsme NULL s ur >> e es eur ient nt >> aient ant f on >> sif sion nt r >> ent er

100,000 tokens, 12,208 types

Zellig redux 1,403 stems

140 signatures

68 suffixes

Extend signatures

226 signatures

Loose fit 2,395 702 signatures

68 suffixes

Check signatures

2,409 730 110

Smooth stems

2,400 735 115

Allomorphy

Find relations among stems: find principles of allomorphy, like

“delete stem-final e before –ing” on the grounds that this simplifies the collection of Signatures:

Compare the signatures NULL.ing, and e.ing.

NULL.ing and e.ing

NULL.ing: its stems do not end in –e -ing (almost) never appears after stem-

final e. (ex. singeing) So e.ing and NULL.ing can both be

subsumed under: <e>ing.NULL, where <e>ing means a

suffix ing which deletes a preceding e.

Find layers of affixation

Find roots (from among the Stem collection)

In other words, recursively look through our list of Stems and see if we could (or should) be analyzing them again:

readings = reading+s = read+ing+s Etc.

What’s the future work?

1. Identifying suffixes through syntactic behavior ( syntax)

2. Better allomorphy ( phonology)

3. Languages with more morphemes/ word (“rich” morphology)

“Using eigenvectors of the bigram graph to infer grammatical features and categories” (Belkin & Goldsmith 2002)

Method

Build a graph in which “similar” words are adjacent;

Compute the normalized laplacian (linear algebra -- it just sound fancy!) of that graph;

Compute the eigenvectors with the lowest non-zero eigenvalues; (more linear algebra)

Plot them.

Map 1,000 English words by left-hand neighbors

non-finite verbs: be, do, go, make,see, get, take, go, say, put, find, give, provide, keep, run…

finite verbs: was, had,has, would, said,could, did, might,went, thought, told, knew, took,asked…

world, way, same, united,right, system, city, case,church, problem, company,past, field, cost, department,university, rate, door,

?: and, to, in that, for, he, as, with,on, by, at, or, from…

Map 1,000 English words by right-hand neighbors

adjectives

social national white local politicalpersonal private strong medical finalblack French technical nuclear british

Prepositions: of in for on by at from into after through under since during against among within along across including near

linguistica. powerpoint? this presentation borrows heavily from slides written by john goldsmith who...

Documents

linguistica slide

english slide

work slide

moment slide

s ing slide

sophisticated signature

letters suffixes

stems signature