university of oslo · introductioncollocationsalignmenttechnical useslinguistic applicationssummary...
TRANSCRIPT
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Automatic alignment and parallel corpora
Hanne Eckhoff Dag Haug
University of Oslo
October 13, 2009
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 1 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Exploiting translations
Codex Marianus is a translation from Greek and this givesopportunities and challenges
In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages
Where categories overlap fully, the suspicion of Greek influence arises
Adding alignments to the database makes it possible to makequantitative studies
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Exploiting translations
Codex Marianus is a translation from Greek and this givesopportunities and challenges
In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages
Where categories overlap fully, the suspicion of Greek influence arises
Adding alignments to the database makes it possible to makequantitative studies
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Exploiting translations
Codex Marianus is a translation from Greek and this givesopportunities and challenges
In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages
Where categories overlap fully, the suspicion of Greek influence arises
Adding alignments to the database makes it possible to makequantitative studies
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Exploiting translations
Codex Marianus is a translation from Greek and this givesopportunities and challenges
In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages
Where categories overlap fully, the suspicion of Greek influence arises
Adding alignments to the database makes it possible to makequantitative studies
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The problem
The Bible is ‘sort of’ aligned via the verse numbering which isapproximately the same in each version
Mark 6.38
îíú æå ãë ãîë èìú. êîëèêî èì òå õë³áú. �ä³òå è âèäèòå. �
îóâ³ä³âúøå ãë ãîë ø¢. ï¢òü õë³áú è äúâ³ ðûá³.
ho de legei autois. posous artous ekhete; upagete idete. kai gnonteslegousin: pente, kai duo ikhthuas.
Potentially each Slavic word in this verse could be a translation ofeach Greek word
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 3 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The problem
The Bible is ‘sort of’ aligned via the verse numbering which isapproximately the same in each version
Mark 6.38
îíú æå ãë ãîë èìú. êîëèêî èì òå õë³áú. �ä³òå è âèäèòå. �
îóâ³ä³âúøå ãë ãîë ø¢. ï¢òü õë³áú è äúâ³ ðûá³.
ho de legei autois. posous artous ekhete; upagete idete. kai gnonteslegousin: pente, kai duo ikhthuas.
Potentially each Slavic word in this verse could be a translation ofeach Greek word
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 3 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The problem
The Bible is ‘sort of’ aligned via the verse numbering which isapproximately the same in each version
Mark 6.38
îíú æå ãë ãîë èìú. êîëèêî èì òå õë³áú. �ä³òå è âèäèòå. �
îóâ³ä³âúøå ãë ãîë ø¢. ï¢òü õë³áú è äúâ³ ðûá³.
ho de legei autois. posous artous ekhete; upagete idete. kai gnonteslegousin: pente, kai duo ikhthuas.
Potentially each Slavic word in this verse could be a translation ofeach Greek word
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 3 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A simple solution that does not work
For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent
However, we would run into problems with frequent words
The article occurs in practically every Greek verse and could end up asthe best translation of many words
Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A simple solution that does not work
For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent
However, we would run into problems with frequent words
The article occurs in practically every Greek verse and could end up asthe best translation of many words
Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A simple solution that does not work
For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent
However, we would run into problems with frequent words
The article occurs in practically every Greek verse and could end up asthe best translation of many words
Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A simple solution that does not work
For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent
However, we would run into problems with frequent words
The article occurs in practically every Greek verse and could end up asthe best translation of many words
Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Correspondences of âúñï³òè
Greek lemma cooccurrences
kai ‘and’ 4exerkhomai ‘go out’ 3eis ‘to’ 3humneo ‘sing’ 2elaia ‘olive’ 2alektor ‘rooster’ 2foneo ‘make a sound’ 2oros ‘mountain’ 2
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 5 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Correspondences of âúñï³òè 2
Greek lemma cooccurrences occurrences ratio
kai ‘and’ 4 2497 0.2%exerkhomai ‘go out’ 3 151 2.0%eis ‘to’ 3 654 0.5%humneo ‘sing’ 2 2 100%elaia ‘olive’ 2 11 18.2%alektor ‘rooster’ 2 10 10%foneo ‘make a sound’ 2 32 6.3%oros ‘mountain’ 2 41 4.9%
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 6 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A solution that does work
Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result
But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell
Verses with âúñï³òè contain humneo in 100% – of the two cases
Intuitively, a correspondence would be more secure if it was based onmore cooccurrences
This is where collocations become useful
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A solution that does work
Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result
But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell
Verses with âúñï³òè contain humneo in 100% – of the two cases
Intuitively, a correspondence would be more secure if it was based onmore cooccurrences
This is where collocations become useful
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A solution that does work
Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result
But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell
Verses with âúñï³òè contain humneo in 100% – of the two cases
Intuitively, a correspondence would be more secure if it was based onmore cooccurrences
This is where collocations become useful
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A solution that does work
Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result
But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell
Verses with âúñï³òè contain humneo in 100% – of the two cases
Intuitively, a correspondence would be more secure if it was based onmore cooccurrences
This is where collocations become useful
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
A solution that does work
Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result
But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell
Verses with âúñï³òè contain humneo in 100% – of the two cases
Intuitively, a correspondence would be more secure if it was based onmore cooccurrences
This is where collocations become useful
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Collocation analysis
Firth’s collocational meanings
One of the meanings of ‘ass’ is its habitual collocation with an immediatelypreceding ‘you silly ...’ (Firth 1951)
A collocation is also sometimes defined as a non-compositionalexpression, but this is not necessarily true of our collocations
The notion of collocation as lexical proximity is formalizable throughstatistic significance measures
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 8 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Collocation analysis
Firth’s collocational meanings
One of the meanings of ‘ass’ is its habitual collocation with an immediatelypreceding ‘you silly ...’ (Firth 1951)
A collocation is also sometimes defined as a non-compositionalexpression, but this is not necessarily true of our collocations
The notion of collocation as lexical proximity is formalizable throughstatistic significance measures
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 8 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Collocation analysis
Firth’s collocational meanings
One of the meanings of ‘ass’ is its habitual collocation with an immediatelypreceding ‘you silly ...’ (Firth 1951)
A collocation is also sometimes defined as a non-compositionalexpression, but this is not necessarily true of our collocations
The notion of collocation as lexical proximity is formalizable throughstatistic significance measures
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 8 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Contingency tables
For each pair of occurring Greek and Slavic lemmata, we construct2x2 contingency tables
ðûá ¬ ðûá
ikhthus 17 0¬ ikhthus 6 3529
These data can be subjected to various statistical tests
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 9 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Collocation measures
Chi-square is a normal test for significance in 2x2 contingency tables
But less well suited for skewed data like ours
Other options include maximal likelihood measures
or the Fisher exact test
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Collocation measures
Chi-square is a normal test for significance in 2x2 contingency tables
But less well suited for skewed data like ours
Other options include maximal likelihood measures
or the Fisher exact test
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Collocation measures
Chi-square is a normal test for significance in 2x2 contingency tables
But less well suited for skewed data like ours
Other options include maximal likelihood measures
or the Fisher exact test
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Collocation measures
Chi-square is a normal test for significance in 2x2 contingency tables
But less well suited for skewed data like ours
Other options include maximal likelihood measures
or the Fisher exact test
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Maximum likelihood measure
Likelihood approaches measure the probability of getting the observeddistribution given the null hypothesis that the distribution iscompletely random
The less likely the distribution is, the more evidence we have againstthe null hypothesis (and thus for a non-random distribution)
The test is two-sided: both positive and negative association giveshigh scores, since they both correspond to unlikely results given arandom distribution (but this is in practice less of a problem)
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 11 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Maximum likelihood measure
Likelihood approaches measure the probability of getting the observeddistribution given the null hypothesis that the distribution iscompletely random
The less likely the distribution is, the more evidence we have againstthe null hypothesis (and thus for a non-random distribution)
The test is two-sided: both positive and negative association giveshigh scores, since they both correspond to unlikely results given arandom distribution (but this is in practice less of a problem)
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 11 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Maximum likelihood measure
Likelihood approaches measure the probability of getting the observeddistribution given the null hypothesis that the distribution iscompletely random
The less likely the distribution is, the more evidence we have againstthe null hypothesis (and thus for a non-random distribution)
The test is two-sided: both positive and negative association giveshigh scores, since they both correspond to unlikely results given arandom distribution (but this is in practice less of a problem)
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 11 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Hypothesis testing
Hypothesis testing looks not just at how unexpected the actualdistribution is, but also at the probability of getting more ‘extreme’results than the observed distribution
There are several hypothesis tests, e.g. chi-square and Fisher
These tests are one-sided, which makes it possible to look at onlypositive association
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 12 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Hypothesis testing
Hypothesis testing looks not just at how unexpected the actualdistribution is, but also at the probability of getting more ‘extreme’results than the observed distribution
There are several hypothesis tests, e.g. chi-square and Fisher
These tests are one-sided, which makes it possible to look at onlypositive association
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 12 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Hypothesis testing
Hypothesis testing looks not just at how unexpected the actualdistribution is, but also at the probability of getting more ‘extreme’results than the observed distribution
There are several hypothesis tests, e.g. chi-square and Fisher
These tests are one-sided, which makes it possible to look at onlypositive association
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 12 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance
The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way
For ðûá and ikhthus this chance is0.000000000000000000000000000000000000000016
But notice that p-values ‘conflate’ population size and effect size
Once you have a large number of observations, most distributions are‘rare’
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance
The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way
For ðûá and ikhthus this chance is0.000000000000000000000000000000000000000016
But notice that p-values ‘conflate’ population size and effect size
Once you have a large number of observations, most distributions are‘rare’
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance
The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way
For ðûá and ikhthus this chance is0.000000000000000000000000000000000000000016
But notice that p-values ‘conflate’ population size and effect size
Once you have a large number of observations, most distributions are‘rare’
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance
The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way
For ðûá and ikhthus this chance is0.000000000000000000000000000000000000000016
But notice that p-values ‘conflate’ population size and effect size
Once you have a large number of observations, most distributions are‘rare’
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance 2
ðûá ¬ ðûá
eis ’into’ 10 644¬ eis ’into’ 13 2855
Table: Fisher exact p = 0.004737432
Word distributions are never random
There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)
But as a relative measure p-values work well: translation equivalentshave even more extreme distributions
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance 2
ðûá ¬ ðûá
eis ’into’ 10 644¬ eis ’into’ 13 2855
Table: Fisher exact p = 0.004737432
Word distributions are never random
There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)
But as a relative measure p-values work well: translation equivalentshave even more extreme distributions
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance 2
ðûá ¬ ðûá
eis ’into’ 10 644¬ eis ’into’ 13 2855
Table: Fisher exact p = 0.004737432
Word distributions are never random
There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)
But as a relative measure p-values work well: translation equivalentshave even more extreme distributions
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
p-values and statistical significance 2
ðûá ¬ ðûá
eis ’into’ 10 644¬ eis ’into’ 13 2855
Table: Fisher exact p = 0.004737432
Word distributions are never random
There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)
But as a relative measure p-values work well: translation equivalentshave even more extreme distributions
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Dictionary generation
Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)
To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones
For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa
We then create a dictionary where words are ranked according to thiscombined score
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Dictionary generation
Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)
To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones
For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa
We then create a dictionary where words are ranked according to thiscombined score
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Dictionary generation
Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)
To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones
For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa
We then create a dictionary where words are ranked according to thiscombined score
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Dictionary generation
Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)
To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones
For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa
We then create a dictionary where words are ranked according to thiscombined score
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Adding information
We have a dictionary which ranks candidates
We could pick the best available candidate for each Slavic word
But his approach would ignore the other information in the corpus
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 16 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Adding information
We have a dictionary which ranks candidates
We could pick the best available candidate for each Slavic word
But his approach would ignore the other information in the corpus
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 16 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Adding information
We have a dictionary which ranks candidates
We could pick the best available candidate for each Slavic word
But his approach would ignore the other information in the corpus
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 16 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Other relevant factors
Part of speech
Position in the sentence
Syntactic relation
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 17 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The algorithm
First identify ‘anchors’, ie. secure alignments
These are the ones where word order matches and the translation isthe best candidate in the dictionary
Then accept progressively worse alignments until a certain treshold isreached, using the other factors
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 18 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The algorithm
First identify ‘anchors’, ie. secure alignments
These are the ones where word order matches and the translation isthe best candidate in the dictionary
Then accept progressively worse alignments until a certain treshold isreached, using the other factors
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 18 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The algorithm
First identify ‘anchors’, ie. secure alignments
These are the ones where word order matches and the translation isthe best candidate in the dictionary
Then accept progressively worse alignments until a certain treshold isreached, using the other factors
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 18 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Sample alignments
îíú æå ãë ãîë èìú êîëèêî èì òå
ho de legei autois posous artous
õë³áú �ä³òå è âèäèòå � îóâ³ä³âúøå
ekhete upagete idete kai gnontes legousin
ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³
pente kai duo ikhthuas.
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Sample alignments
îíú æå ãë ãîë èìú êîëèêî
ho de legei autois posous artous
èì òå õë³áú �ä³òå è âèäèòå �
ekhete upagete idete kai
îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³
gnontes legousin pente kai duo ikhthuas.
First we get the ‘anchors’, the perfect alignments. At this point we donot now which è kai belongs to!
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Sample alignments
îíú æå ãë ãîë èìú êîëèêî
ho de legei autois posous artous
èì òå õë³áú �ä³òå è âèäèòå �
ekhete upagete idete kai
îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³
gnontes legousin pente kai duo ikhthuas.
Next, we get good alignments like âèäèòå and idete, äúâ³ and duo,�ä³òå and upagete
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Sample alignments
îíú æå ãë ãîë èìú êîëèêî
ho de legei autois posous artous
èì òå õë³áú �ä³òå è âèäèòå �
ekhete upagete idete kai
îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³
gnontes legousin pente kai duo ikhthuas.
Now we know where the two kai belong! We have also waited withthings like ãë ãîë and legei, although this is a perfect matches interms of lexicon, position, POS and relation, it is also duplicated.
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Sample alignments
îíú æå ãë ãîë èìú êîëèêî
ho de legei autois posous artous
èì òå õë³áú �ä³òå è âèäèòå �
ekhete upagete idete kai
îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³
gnontes legousin pente kai duo ikhthuas.
Finally õë³áú and artous are aligned; this comes late because itimplies an inverted word order
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Sample alignments
îíú æå ãë ãîë èìú êîëèêî
ho de legei autois posous artous
èì òå õë³áú �ä³òå è âèäèòå �
ekhete upagete idete kai
îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³
gnontes legousin pente kai duo ikhthuas.
The best alignment the system can now find is ho and îíú, which isrejected because of a misannotation. The algorithm stops.
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Tag transfer
The PROIEL corpus relies on multi-level manual annotation
Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others
Information structure: transfer of tags and anaphoric links, correction
Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Tag transfer
The PROIEL corpus relies on multi-level manual annotation
Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others
Information structure: transfer of tags and anaphoric links, correction
Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Tag transfer
The PROIEL corpus relies on multi-level manual annotation
Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others
Information structure: transfer of tags and anaphoric links, correction
Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Tag transfer
The PROIEL corpus relies on multi-level manual annotation
Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others
Information structure: transfer of tags and anaphoric links, correction
Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Transferring animacy tags
Lemma-level annotation of all Greek nouns
93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives
Transfer program:
finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma
About 95% success
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Transferring animacy tags
Lemma-level annotation of all Greek nouns
93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives
Transfer program:
finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma
About 95% success
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Transferring animacy tags
Lemma-level annotation of all Greek nouns
93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives
Transfer program:
finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma
About 95% success
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Transferring animacy tags
Lemma-level annotation of all Greek nouns
93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives
Transfer program:
finds all OCS nouns and adjectives that are translations of Greek nouns
finds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma
About 95% success
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Transferring animacy tags
Lemma-level annotation of all Greek nouns
93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives
Transfer program:
finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)
assigns that tag to the OCS lemma
About 95% success
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Transferring animacy tags
Lemma-level annotation of all Greek nouns
93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives
Transfer program:
finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma
About 95% success
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Transferring animacy tags
Lemma-level annotation of all Greek nouns
93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives
Transfer program:
finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma
About 95% success
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Books and writings
êúíèãû comes out as NONCONC. Why?
Greek lemma occurrences animacy tag
biblion ‘document, book’ 6 CONCRETEbiblos ‘book’ 2 CONCRETEgramma ‘letter, piece of writing’ 2 CONCRETEgraphe ‘writing, scripture’ 20 NONCONC
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 22 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Word order phenomena
The OCS translation notoriously follows the Greek word order, muchmore than the Greek syntax
The token alignments offer us a quick way of picking out the fewword order discrepancies
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 23 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Word order phenomena
The OCS translation notoriously follows the Greek word order, muchmore than the Greek syntax
The token alignments offer us a quick way of picking out the fewword order discrepancies
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 23 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy
Test case: The interrelationship between Greek definiteness and OCSanimacy marking
OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging
All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy
Test case: The interrelationship between Greek definiteness and OCSanimacy marking
OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging
All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy
Test case: The interrelationship between Greek definiteness and OCSanimacy marking
OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging
All genitive-formed tokens are tagged as morphological genitives
Regular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy
Test case: The interrelationship between Greek definiteness and OCSanimacy marking
OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging
All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)
Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy
Test case: The interrelationship between Greek definiteness and OCSanimacy marking
OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging
All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLs
Verbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy
Test case: The interrelationship between Greek definiteness and OCSanimacy marking
OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging
All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy and definiteness
Huntley 1993 (137–138): ‘There is a strong tendency for thegenitive-accusative to refer to a definite object, and for thenominative-accusative to refer to an indefinite object’.
Huntley counts objects with definite and indefinite reference
How do OCS objects correlate with Greek definite and indefiniteobjects?
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 25 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy and definiteness
Huntley 1993 (137–138): ‘There is a strong tendency for thegenitive-accusative to refer to a definite object, and for thenominative-accusative to refer to an indefinite object’.
Huntley counts objects with definite and indefinite reference
How do OCS objects correlate with Greek definite and indefiniteobjects?
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 25 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Animacy and definiteness
Huntley 1993 (137–138): ‘There is a strong tendency for thegenitive-accusative to refer to a definite object, and for thenominative-accusative to refer to an indefinite object’.
Huntley counts objects with definite and indefinite reference
How do OCS objects correlate with Greek definite and indefiniteobjects?
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 25 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The data set
All OCS common and proper nouns with the following restrictions
Syntactic tag: OBJHas token alignmentAligned token animacy: humanAligned token part of speech: nounNot negatedGender: masculineNumber: singularLemma form: not a-stem noun
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 26 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
The actual data set
ocs obj,ocs id,greek,greek id,ocs case,has article,noun type,info status,saliencyñúâ¢çüí³,550311,desmion,113682,g,false,Nb,new,0ñúâ¢çüí³,580761,desmion,286379,g,false,Nb
”0
âîèí ,542596,spekoulatora,105926,g,false,Nb,new,0ñúòúíèê ,550920,kenturiona,114298,g,true,Nb,old,0.00934579439252336æåíèõ ,539711,numphion,102949,g,true,Nb
”0
â ðòîëîì³ ,540216,Bartholomaion,103458,g,false,Ne,0.617214912280702ì òúòå ,540219,Maththaion,103460,g,false,Ne,acc inf,0.615376676986584âåëü¯³âîëú,540288,Beelzeboul,103524,a,false,Ne
”0
. . .
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 27 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Greek and OCS token aligned objects
Greek object OCS accusative per cent OCS genitive per cent
definite 20 10.6% 168 89.4%indefinite 24 22.6% 82 77.4%
Table: Human token aligned OBJs, masc.sg., corrected for negation. P-value0.0099
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 28 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Proper noun bias?
Animacy annotation can be crossed with morphological annotation forproper/common nouns
Proper nouns are at the top of the animacy hierarchy
Do proper and common nouns behave differently?
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 29 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Proper noun bias?
Animacy annotation can be crossed with morphological annotation forproper/common nouns
Proper nouns are at the top of the animacy hierarchy
Do proper and common nouns behave differently?
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 29 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Proper noun bias?
Animacy annotation can be crossed with morphological annotation forproper/common nouns
Proper nouns are at the top of the animacy hierarchy
Do proper and common nouns behave differently?
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 29 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Proper nouns vs. common nouns
Greek object OCS acc. per cent OCS gen. per cent
proper definite 0 0% 57 100%proper indefinite 1 2.2% 45 97.8%common definite 20 15.3% 111 84.7%
common indefinite 23 37.5% 37 62.5%
Table: Human OBJs, masc.sg., corrected for negation and grouped by noun type,compared to their Greek alignments
Proper nouns in the nominative-accusative are very rare
The real difference is between definite and indefinite common nouns(P-value 0.0007)
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 30 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Not animacy alone
The distribution of the genitive-accusative was not regulated byanimacy and social prominence alone
Discourse prominence also plays a role
Genitive-accusatives are more prone to be old or easily accessibleinformation than nominative-accusatives
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 31 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Not animacy alone
The distribution of the genitive-accusative was not regulated byanimacy and social prominence alone
Discourse prominence also plays a role
Genitive-accusatives are more prone to be old or easily accessibleinformation than nominative-accusatives
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 31 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Not animacy alone
The distribution of the genitive-accusative was not regulated byanimacy and social prominence alone
Discourse prominence also plays a role
Genitive-accusatives are more prone to be old or easily accessibleinformation than nominative-accusatives
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 31 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Summary and conclusions
Automatic dictionary creation and token alignment
Technically useful: tag transfers
Powerful tool for contrastive linguistics
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 32 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Summary and conclusions
Automatic dictionary creation and token alignment
Technically useful: tag transfers
Powerful tool for contrastive linguistics
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 32 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Summary and conclusions
Automatic dictionary creation and token alignment
Technically useful: tag transfers
Powerful tool for contrastive linguistics
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 32 / 33
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions
Availability
The corpus is available for everyone to use.
We publish XML files with raw data as well.
All our data is released under a Creative Commons license.
Visit http://www.hf.uio.no/ifikk/proiel/ for details.
Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 33 / 33