linguistica. powerpoint? this presentation borrows heavily from slides written by john goldsmith who...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Linguistica
Powerpoint?
This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks, John.
He also says I should enjoy my trip, and one way to do that is to not have to write as many slides while I’m here!
Linguistica
A C++ program that runs under Windows, Mac OS X, and Linux that is available at:
http://humanities.uchicago.edu/ faculty/goldsmith/
There are explanations, papers, and other downloadable tools available there.
References (for the 1st part)
Goldsmith (2001) “Unsupervised Learning of the Morphology of a Natural Language” Computational Linguistics
Overview
Look at Linguistica in action:
English, French Theoretical foundations Underlying heuristics Further work
Linguistica
A program that takes in a text in an “unknown” language…
…and produces a morphological analysis:a list of stems, prefixes, suffixes;more deeply embedded morphological
structure;regular allomorphy
Linguistica
Actions and outlines of information
Here: lists of stems, affixes, signatures, etc.
Here: some messagesfrom the analyst to theuser.
Read a corpus
Brown corpus: 1,200,000 words of typical English
French Encarta or anything else you like, in a text file. Set the number of words you want read,
then select the file.
A stem’s signature is the list of suffixes it appears with in the corpus,in alphabetical order.
abilit ies.y abilities, abilityaboli tion abolitionabsen ce.t absence, absentabsolute NULL.ly absolute, absolutely
List of stems
List of signatures
Signature: NULL.ed.ing.sfor example,account accounted accounting accountsadd added adding adds
Signature <e>ion . NULL
composite concentrate corporate détente discriminate evacuate inflate oppositeparticipate probate prosecute tense
What is this?
composite and composition
composite composit composit + ion
It infers that ion deletes a stem-final ‘e’ before attaching.
We’ll see how we can find a more sophisticated signature…
Top signatures in English
Over-arching theory
The selection of a grammar, given the data, is an optimization problem.
Optimization means finding a maximum or minimum of some objective function
Minimum Description Length provides us with a means for understanding grammar selection as minimizing a function.
(We’ll get to MDL in a moment)
What’s being minimized by writing a good morphology? The number of letters is part of it
Compare:
Naive Minimum Description Length
Corpus:
jump, jumps, jumping
laugh, laughed, laughing
sing, sang, singing
the, dog, dogs
total: 61 letters
Analysis:
Stems: jump laugh sing sang dog (20 letters)
Suffixes: s ing ed (6 letters)
Unanalyzed: the (3 letters)
total: 29 letters.
Notice that the description length goes UP if we analyze sing into s+ing
Minimum Description Length (MDL)
Rissanen (1989) (not a CL paper) The best “theory” of a set of data is the
one which is simultaneously:1. most compact or concise, and2. provides the best modeling of the data
“Most compact” can be measured in bits, using information theory
“Best modeling” can also be measured in bits…
Essence of MDL
0
100000
200000
300000
400000
500000
600000
700000
Best analysis Elegant theorythat works badly
Complex theorymodeled from
data
Length of morphologyLog prob of corpus
Description Length =
Conciseness: Length of the morphology. It’s almost as if you count up the number of symbols in the morphology (in the stems, the affixes, and the rules).
Length of the modeling of the data. We want a measure which gets bigger as the morphology is a worse description of the data.
Add these two lengths together = Description Length
Conciseness of the morphology
Sum all the letters, plus all the structure inherent in the description, using information theory.
Entropy was the weighted (by p(x)) sum of the information content or optimal compressed length (–log2 p(x)) of x. It’s called that because it is always possible to develop a compression scheme by which a symbol x, emitted with probability p(x), is represented by a placeholder of length -log2 p(x) bits.
Remember Entropy?
€
H(X) = − p(x)log2 p(x)x∈X
∑
Optimal Compressed Length
The reason this is mentioned is that we will have lots of pieces of information in our model, and we’d like to figure out how much “space” it takes up.
Remember, we want the smallest model possible, so we are going to want the best compression for anything in our model
Also, remember this:
€
−log p(x) = log1
p(x)
Conciseness of stem list and suffix list
€
(ii) Suffix list λ* | f | + log[WA ]
[ f ]
⎛
⎝ ⎜
⎞
⎠ ⎟
f ∈Suffixes
∑
€
(iii) Stem list : λ* | t | + log([W ]
[t])
⎛
⎝ ⎜
⎞
⎠ ⎟
t∈Stems
∑
Number of letters in stem
cost of setting upthis entity: lengthof pointer in bits
Number of letters in suffix
= number of bits/letter < 5
Signature list length
€
log[W ]
[σ ]σ ∈Signatures
∑ list of pointers to signatures
€
+ log < stems(σ ) > + log < suffixes(σ ) >σ ∈Signatures
∑
€
+ ( log[W ]
[t]t∈Stems(σ )
∑σ ∈Sigs
∑ + log[σ ]
[ f in σ ]f ∈Suffixes(σ )
∑ )
<X> indicates the numberof distinct elements in X
Length of the modeling of the data
Probabilistic morphology: the measure: -1 * log probability ( data )
where the morphology assigns a probability to any data set.
This is known in information theory as the optimal compressed length of the data (given the model).
Probability of a data set?
A grammar can be used not (just) to specify what is grammatical and what is not, but to assign a probability to each string (or structure).
If we have two grammars that assign different probabilities, then the one that assigns a higher probability to the observed data is the better one.
This follows from the basic principle of rationality in the Universe:
Maximize the probability of the observed data.
From all this, it follows:
There is an objective answer to the question: which of two analyses of a given set of data is better?
However, there is no general, practical guarantee of being able to find the best analysis of a given set of data.
Hence, we need to think of (this sort of) linguistics as being divided into two parts:
An evaluator (which computes the Description Length); and
A set of heuristics, which create grammars from data, and which propose modifications of grammars, in the hopes of improving the grammar.
(Remember, these “things” are mathematical things: algorithms.)
Let’s step back for a minute
Why is this problem so hard at first? Because figuring out the best analysis of
any given word generally requires having figured out the rough outlines of the whole overall morphology. (Same is true for other parts of the grammar!).
How do we start?
You all know the answer to this question already…
We start with Zellig Harris’ successor frequency!
Although we got some good answers, we also saw that it made lots of mistakes
So…
As a boot-strapping method to construct a first approximation of the signatures: Harris’ method is pretty good. We accept only stems of 5 letters or more; Only cuts where the SuccFreq is > 1, and
where the neighboring SuccFreq is 1. (This setup was experiment 16 from the
lab on Monday)
Let’s look at how the work is done (in the abstract), step by step...
Corpus
Pick a large corpus from a language --5,000 to 1,000,000 words.
Corpus
Bootstrap heuristicFeed it into the “bootstrapping” heuristic...
Corpus
Out of which comes a preliminary morphology,which need not be superb.Morphology
Bootstrap heuristic
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
Feed it to the incrementalheuristics (…which wehaven’t seen yet)
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
modified morphology
Out comes a modifiedmorphology.
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
modified morphology
Is the modificationan improvement?Ask MDL!
Corpus
Morphology
Bootstrap heuristic
modified morphology
If it is an improvement,replace the morphology...
Garbage
Corpus
Bootstrap heuristic
incremental heuristics
modified morphology
Send it back to theincremental heuristics again...
Morphology
incremental heuristics
modified morphology
Continue until there are no improvementsto try.
The details of learning morphology
There is nothing sacred about the particular choice of heuristic steps
Steps Successor Frequency: strict Extend signatures to cases where a word
is composed of a known stem and a known suffix.
Loose fit: Look at all unanalyzed words. Look to see if they can cut: stem + suffix, where the suffix already exists. Do this in all possible ways. See if any of these lead to stems with signatures that already exist. If so, take the “best” one. If not, compute the utility of the signature using MDL.
Check existing signatures: Using MDL to find best stem/suffix cut. Examples…
Check signatures (English)
on/ve → ion/ive an/en → man/men l/tion → al/ation m/t → alism/alist, etc.
How?
Check signatures
Signature l/tion with stems:federa inaugura orienta substantiaWe need to compute the Description Length
of the analysis as it stands versusas it would be if we shifted varying parts of
the stems to the suffixes.
“Check signatures” French:
NULL nt r >> a ant ar NULL nt >> i int ent t >> oient oit NULL r >> i ir f on ve >> sif sion sive eur ion >> seur sion ce t >> ruce rut se x >> ouse oux l ux >> al aux
me te >> ume ute eurs ion >> teurs tion f ve >> dif dive it nt >> ait ant que sme >> ïque ïsme NULL s ur >> e es eur ient nt >> aient ant f on >> sif sion nt r >> ent er
100,000 tokens, 12,208 types
Zellig redux 1,403 stems
140 signatures
68 suffixes
Extend signatures
226 signatures
Loose fit 2,395 702 signatures
68 suffixes
Check signatures
2,409 730 110
Smooth stems
2,400 735 115
Allomorphy
Find relations among stems: find principles of allomorphy, like
“delete stem-final e before –ing” on the grounds that this simplifies the collection of Signatures:
Compare the signatures NULL.ing, and e.ing.
NULL.ing and e.ing
NULL.ing: its stems do not end in –e -ing (almost) never appears after stem-
final e. (ex. singeing) So e.ing and NULL.ing can both be
subsumed under: <e>ing.NULL, where <e>ing means a
suffix ing which deletes a preceding e.
Find layers of affixation
Find roots (from among the Stem collection)
In other words, recursively look through our list of Stems and see if we could (or should) be analyzing them again:
readings = reading+s = read+ing+s Etc.
What’s the future work?
1. Identifying suffixes through syntactic behavior ( syntax)
2. Better allomorphy ( phonology)
3. Languages with more morphemes/ word (“rich” morphology)
“Using eigenvectors of the bigram graph to infer grammatical features and categories” (Belkin & Goldsmith 2002)
Method
Build a graph in which “similar” words are adjacent;
Compute the normalized laplacian (linear algebra -- it just sound fancy!) of that graph;
Compute the eigenvectors with the lowest non-zero eigenvalues; (more linear algebra)
Plot them.
Map 1,000 English words by left-hand neighbors
non-finite verbs: be, do, go, make,see, get, take, go, say, put, find, give, provide, keep, run…
finite verbs: was, had,has, would, said,could, did, might,went, thought, told, knew, took,asked…
world, way, same, united,right, system, city, case,church, problem, company,past, field, cost, department,university, rate, door,
?: and, to, in that, for, he, as, with,on, by, at, or, from…
Map 1,000 English words by right-hand neighbors
adjectives
social national white local politicalpersonal private strong medical finalblack French technical nuclear british
Prepositions: of in for on by at from into after through under since during against among within along across including near
End