university of oslo · introductioncollocationsalignmenttechnical useslinguistic applicationssummary...

95
Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions Automatic alignment and parallel corpora Hanne Eckhoff Dag Haug University of Oslo October 13, 2009 Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 1 / 33

Upload: others

Post on 20-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Automatic alignment and parallel corpora

Hanne Eckhoff Dag Haug

University of Oslo

October 13, 2009

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 1 / 33

Page 2: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Exploiting translations

Codex Marianus is a translation from Greek and this givesopportunities and challenges

In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages

Where categories overlap fully, the suspicion of Greek influence arises

Adding alignments to the database makes it possible to makequantitative studies

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33

Page 3: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Exploiting translations

Codex Marianus is a translation from Greek and this givesopportunities and challenges

In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages

Where categories overlap fully, the suspicion of Greek influence arises

Adding alignments to the database makes it possible to makequantitative studies

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33

Page 4: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Exploiting translations

Codex Marianus is a translation from Greek and this givesopportunities and challenges

In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages

Where categories overlap fully, the suspicion of Greek influence arises

Adding alignments to the database makes it possible to makequantitative studies

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33

Page 5: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Exploiting translations

Codex Marianus is a translation from Greek and this givesopportunities and challenges

In cases where categories do not fully overlap between the languages,the study of translation correspondences can shed light on bothlanguages

Where categories overlap fully, the suspicion of Greek influence arises

Adding alignments to the database makes it possible to makequantitative studies

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 2 / 33

Page 6: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The problem

The Bible is ‘sort of’ aligned via the verse numbering which isapproximately the same in each version

Mark 6.38

îíú æå ãë ãîë  èìú. êîëèêî èì òå õë³áú. �ä³òå è âèäèòå. �

îóâ³ä³âúøå ãë ãîë ø¢. ï¢òü õë³áú è äúâ³ ðûá³.

ho de legei autois. posous artous ekhete; upagete idete. kai gnonteslegousin: pente, kai duo ikhthuas.

Potentially each Slavic word in this verse could be a translation ofeach Greek word

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 3 / 33

Page 7: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The problem

The Bible is ‘sort of’ aligned via the verse numbering which isapproximately the same in each version

Mark 6.38

îíú æå ãë ãîë  èìú. êîëèêî èì òå õë³áú. �ä³òå è âèäèòå. �

îóâ³ä³âúøå ãë ãîë ø¢. ï¢òü õë³áú è äúâ³ ðûá³.

ho de legei autois. posous artous ekhete; upagete idete. kai gnonteslegousin: pente, kai duo ikhthuas.

Potentially each Slavic word in this verse could be a translation ofeach Greek word

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 3 / 33

Page 8: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The problem

The Bible is ‘sort of’ aligned via the verse numbering which isapproximately the same in each version

Mark 6.38

îíú æå ãë ãîë  èìú. êîëèêî èì òå õë³áú. �ä³òå è âèäèòå. �

îóâ³ä³âúøå ãë ãîë ø¢. ï¢òü õë³áú è äúâ³ ðûá³.

ho de legei autois. posous artous ekhete; upagete idete. kai gnonteslegousin: pente, kai duo ikhthuas.

Potentially each Slavic word in this verse could be a translation ofeach Greek word

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 3 / 33

Page 9: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A simple solution that does not work

For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent

However, we would run into problems with frequent words

The article occurs in practically every Greek verse and could end up asthe best translation of many words

Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33

Page 10: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A simple solution that does not work

For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent

However, we would run into problems with frequent words

The article occurs in practically every Greek verse and could end up asthe best translation of many words

Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33

Page 11: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A simple solution that does not work

For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent

However, we would run into problems with frequent words

The article occurs in practically every Greek verse and could end up asthe best translation of many words

Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33

Page 12: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A simple solution that does not work

For each Slavic word we could just count all Greek words that occurin the same verse and select the most frequent one as the probabletranslation equivalent

However, we would run into problems with frequent words

The article occurs in practically every Greek verse and could end up asthe best translation of many words

Although we could eliminate this lemma from consideration, otherfrequent words would create similar problems

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 4 / 33

Page 13: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Correspondences of âúñï³òè

Greek lemma cooccurrences

kai ‘and’ 4exerkhomai ‘go out’ 3eis ‘to’ 3humneo ‘sing’ 2elaia ‘olive’ 2alektor ‘rooster’ 2foneo ‘make a sound’ 2oros ‘mountain’ 2

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 5 / 33

Page 14: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Correspondences of âúñï³òè 2

Greek lemma cooccurrences occurrences ratio

kai ‘and’ 4 2497 0.2%exerkhomai ‘go out’ 3 151 2.0%eis ‘to’ 3 654 0.5%humneo ‘sing’ 2 2 100%elaia ‘olive’ 2 11 18.2%alektor ‘rooster’ 2 10 10%foneo ‘make a sound’ 2 32 6.3%oros ‘mountain’ 2 41 4.9%

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 6 / 33

Page 15: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A solution that does work

Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result

But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell

Verses with âúñï³òè contain humneo in 100% – of the two cases

Intuitively, a correspondence would be more secure if it was based onmore cooccurrences

This is where collocations become useful

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33

Page 16: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A solution that does work

Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result

But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell

Verses with âúñï³òè contain humneo in 100% – of the two cases

Intuitively, a correspondence would be more secure if it was based onmore cooccurrences

This is where collocations become useful

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33

Page 17: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A solution that does work

Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result

But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell

Verses with âúñï³òè contain humneo in 100% – of the two cases

Intuitively, a correspondence would be more secure if it was based onmore cooccurrences

This is where collocations become useful

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33

Page 18: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A solution that does work

Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result

But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell

Verses with âúñï³òè contain humneo in 100% – of the two cases

Intuitively, a correspondence would be more secure if it was based onmore cooccurrences

This is where collocations become useful

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33

Page 19: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

A solution that does work

Looking at the number of cooccurrences relative to the totalfrequency of the Greek lemma gives us the right result

But if we want to have a measure of how good a correspondence is,we need to take into account the frequency of the Slavic lemma aswell

Verses with âúñï³òè contain humneo in 100% – of the two cases

Intuitively, a correspondence would be more secure if it was based onmore cooccurrences

This is where collocations become useful

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 7 / 33

Page 20: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Collocation analysis

Firth’s collocational meanings

One of the meanings of ‘ass’ is its habitual collocation with an immediatelypreceding ‘you silly ...’ (Firth 1951)

A collocation is also sometimes defined as a non-compositionalexpression, but this is not necessarily true of our collocations

The notion of collocation as lexical proximity is formalizable throughstatistic significance measures

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 8 / 33

Page 21: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Collocation analysis

Firth’s collocational meanings

One of the meanings of ‘ass’ is its habitual collocation with an immediatelypreceding ‘you silly ...’ (Firth 1951)

A collocation is also sometimes defined as a non-compositionalexpression, but this is not necessarily true of our collocations

The notion of collocation as lexical proximity is formalizable throughstatistic significance measures

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 8 / 33

Page 22: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Collocation analysis

Firth’s collocational meanings

One of the meanings of ‘ass’ is its habitual collocation with an immediatelypreceding ‘you silly ...’ (Firth 1951)

A collocation is also sometimes defined as a non-compositionalexpression, but this is not necessarily true of our collocations

The notion of collocation as lexical proximity is formalizable throughstatistic significance measures

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 8 / 33

Page 23: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Contingency tables

For each pair of occurring Greek and Slavic lemmata, we construct2x2 contingency tables

ðûá  ¬ ðûá 

ikhthus 17 0¬ ikhthus 6 3529

These data can be subjected to various statistical tests

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 9 / 33

Page 24: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Collocation measures

Chi-square is a normal test for significance in 2x2 contingency tables

But less well suited for skewed data like ours

Other options include maximal likelihood measures

or the Fisher exact test

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33

Page 25: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Collocation measures

Chi-square is a normal test for significance in 2x2 contingency tables

But less well suited for skewed data like ours

Other options include maximal likelihood measures

or the Fisher exact test

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33

Page 26: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Collocation measures

Chi-square is a normal test for significance in 2x2 contingency tables

But less well suited for skewed data like ours

Other options include maximal likelihood measures

or the Fisher exact test

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33

Page 27: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Collocation measures

Chi-square is a normal test for significance in 2x2 contingency tables

But less well suited for skewed data like ours

Other options include maximal likelihood measures

or the Fisher exact test

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 10 / 33

Page 28: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Maximum likelihood measure

Likelihood approaches measure the probability of getting the observeddistribution given the null hypothesis that the distribution iscompletely random

The less likely the distribution is, the more evidence we have againstthe null hypothesis (and thus for a non-random distribution)

The test is two-sided: both positive and negative association giveshigh scores, since they both correspond to unlikely results given arandom distribution (but this is in practice less of a problem)

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 11 / 33

Page 29: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Maximum likelihood measure

Likelihood approaches measure the probability of getting the observeddistribution given the null hypothesis that the distribution iscompletely random

The less likely the distribution is, the more evidence we have againstthe null hypothesis (and thus for a non-random distribution)

The test is two-sided: both positive and negative association giveshigh scores, since they both correspond to unlikely results given arandom distribution (but this is in practice less of a problem)

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 11 / 33

Page 30: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Maximum likelihood measure

Likelihood approaches measure the probability of getting the observeddistribution given the null hypothesis that the distribution iscompletely random

The less likely the distribution is, the more evidence we have againstthe null hypothesis (and thus for a non-random distribution)

The test is two-sided: both positive and negative association giveshigh scores, since they both correspond to unlikely results given arandom distribution (but this is in practice less of a problem)

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 11 / 33

Page 31: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Hypothesis testing

Hypothesis testing looks not just at how unexpected the actualdistribution is, but also at the probability of getting more ‘extreme’results than the observed distribution

There are several hypothesis tests, e.g. chi-square and Fisher

These tests are one-sided, which makes it possible to look at onlypositive association

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 12 / 33

Page 32: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Hypothesis testing

Hypothesis testing looks not just at how unexpected the actualdistribution is, but also at the probability of getting more ‘extreme’results than the observed distribution

There are several hypothesis tests, e.g. chi-square and Fisher

These tests are one-sided, which makes it possible to look at onlypositive association

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 12 / 33

Page 33: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Hypothesis testing

Hypothesis testing looks not just at how unexpected the actualdistribution is, but also at the probability of getting more ‘extreme’results than the observed distribution

There are several hypothesis tests, e.g. chi-square and Fisher

These tests are one-sided, which makes it possible to look at onlypositive association

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 12 / 33

Page 34: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance

The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way

For ðûá  and ikhthus this chance is0.000000000000000000000000000000000000000016

But notice that p-values ‘conflate’ population size and effect size

Once you have a large number of observations, most distributions are‘rare’

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33

Page 35: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance

The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way

For ðûá  and ikhthus this chance is0.000000000000000000000000000000000000000016

But notice that p-values ‘conflate’ population size and effect size

Once you have a large number of observations, most distributions are‘rare’

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33

Page 36: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance

The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way

For ðûá  and ikhthus this chance is0.000000000000000000000000000000000000000016

But notice that p-values ‘conflate’ population size and effect size

Once you have a large number of observations, most distributions are‘rare’

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33

Page 37: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance

The intuition behind p-values is that they express the likelihood thatwe would get the given distribution, or a more extreme one, if thelemmas were distributed in verses in a completely arbitrary way

For ðûá  and ikhthus this chance is0.000000000000000000000000000000000000000016

But notice that p-values ‘conflate’ population size and effect size

Once you have a large number of observations, most distributions are‘rare’

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 13 / 33

Page 38: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance 2

ðûá  ¬ ðûá 

eis ’into’ 10 644¬ eis ’into’ 13 2855

Table: Fisher exact p = 0.004737432

Word distributions are never random

There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)

But as a relative measure p-values work well: translation equivalentshave even more extreme distributions

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33

Page 39: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance 2

ðûá  ¬ ðûá 

eis ’into’ 10 644¬ eis ’into’ 13 2855

Table: Fisher exact p = 0.004737432

Word distributions are never random

There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)

But as a relative measure p-values work well: translation equivalentshave even more extreme distributions

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33

Page 40: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance 2

ðûá  ¬ ðûá 

eis ’into’ 10 644¬ eis ’into’ 13 2855

Table: Fisher exact p = 0.004737432

Word distributions are never random

There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)

But as a relative measure p-values work well: translation equivalentshave even more extreme distributions

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33

Page 41: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

p-values and statistical significance 2

ðûá  ¬ ðûá 

eis ’into’ 10 644¬ eis ’into’ 13 2855

Table: Fisher exact p = 0.004737432

Word distributions are never random

There are all sort of reasons why words cooccur (for example, fisherstend to put things into the sea to catch fish)

But as a relative measure p-values work well: translation equivalentshave even more extreme distributions

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 14 / 33

Page 42: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Dictionary generation

Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)

To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones

For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa

We then create a dictionary where words are ranked according to thiscombined score

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33

Page 43: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Dictionary generation

Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)

To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones

For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa

We then create a dictionary where words are ranked according to thiscombined score

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33

Page 44: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Dictionary generation

Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)

To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones

For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa

We then create a dictionary where words are ranked according to thiscombined score

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33

Page 45: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Dictionary generation

Even in the collocation approach, frequent words more often turn upas translation equivalents than they should (because population sizetends to make distributions more unlikely)

To correct for this, we compute ‘inverse collocations’ as well, ie. werank the Slavic words as equivalents of the Greek ones

For each Slavic lemma, we then combine the scores of the Slaviclemma as a correspondence to the Greek one and vice versa

We then create a dictionary where words are ranked according to thiscombined score

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 15 / 33

Page 46: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Adding information

We have a dictionary which ranks candidates

We could pick the best available candidate for each Slavic word

But his approach would ignore the other information in the corpus

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 16 / 33

Page 47: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Adding information

We have a dictionary which ranks candidates

We could pick the best available candidate for each Slavic word

But his approach would ignore the other information in the corpus

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 16 / 33

Page 48: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Adding information

We have a dictionary which ranks candidates

We could pick the best available candidate for each Slavic word

But his approach would ignore the other information in the corpus

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 16 / 33

Page 49: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Other relevant factors

Part of speech

Position in the sentence

Syntactic relation

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 17 / 33

Page 50: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The algorithm

First identify ‘anchors’, ie. secure alignments

These are the ones where word order matches and the translation isthe best candidate in the dictionary

Then accept progressively worse alignments until a certain treshold isreached, using the other factors

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 18 / 33

Page 51: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The algorithm

First identify ‘anchors’, ie. secure alignments

These are the ones where word order matches and the translation isthe best candidate in the dictionary

Then accept progressively worse alignments until a certain treshold isreached, using the other factors

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 18 / 33

Page 52: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The algorithm

First identify ‘anchors’, ie. secure alignments

These are the ones where word order matches and the translation isthe best candidate in the dictionary

Then accept progressively worse alignments until a certain treshold isreached, using the other factors

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 18 / 33

Page 53: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Sample alignments

îíú æå ãë ãîë  èìú êîëèêî èì òå

ho de legei autois posous artous

õë³áú �ä³òå è âèäèòå � îóâ³ä³âúøå

ekhete upagete idete kai gnontes legousin

ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³

pente kai duo ikhthuas.

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33

Page 54: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Sample alignments

îíú æå ãë ãîë  èìú êîëèêî

ho de legei autois posous artous

èì òå õë³áú �ä³òå è âèäèòå �

ekhete upagete idete kai

îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³

gnontes legousin pente kai duo ikhthuas.

First we get the ‘anchors’, the perfect alignments. At this point we donot now which è kai belongs to!

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33

Page 55: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Sample alignments

îíú æå ãë ãîë  èìú êîëèêî

ho de legei autois posous artous

èì òå õë³áú �ä³òå è âèäèòå �

ekhete upagete idete kai

îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³

gnontes legousin pente kai duo ikhthuas.

Next, we get good alignments like âèäèòå and idete, äúâ³ and duo,�ä³òå and upagete

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33

Page 56: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Sample alignments

îíú æå ãë ãîë  èìú êîëèêî

ho de legei autois posous artous

èì òå õë³áú �ä³òå è âèäèòå �

ekhete upagete idete kai

îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³

gnontes legousin pente kai duo ikhthuas.

Now we know where the two kai belong! We have also waited withthings like ãë ãîë  and legei, although this is a perfect matches interms of lexicon, position, POS and relation, it is also duplicated.

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33

Page 57: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Sample alignments

îíú æå ãë ãîë  èìú êîëèêî

ho de legei autois posous artous

èì òå õë³áú �ä³òå è âèäèòå �

ekhete upagete idete kai

îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³

gnontes legousin pente kai duo ikhthuas.

Finally õë³áú and artous are aligned; this comes late because itimplies an inverted word order

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33

Page 58: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Sample alignments

îíú æå ãë ãîë  èìú êîëèêî

ho de legei autois posous artous

èì òå õë³áú �ä³òå è âèäèòå �

ekhete upagete idete kai

îóâ³ä³âúøå ãë ãîë ø¢ ï¢òü õë³áú è äúâ³ ðûá³

gnontes legousin pente kai duo ikhthuas.

The best alignment the system can now find is ho and îíú, which isrejected because of a misannotation. The algorithm stops.

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 19 / 33

Page 59: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Tag transfer

The PROIEL corpus relies on multi-level manual annotation

Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others

Information structure: transfer of tags and anaphoric links, correction

Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33

Page 60: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Tag transfer

The PROIEL corpus relies on multi-level manual annotation

Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others

Information structure: transfer of tags and anaphoric links, correction

Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33

Page 61: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Tag transfer

The PROIEL corpus relies on multi-level manual annotation

Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others

Information structure: transfer of tags and anaphoric links, correction

Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33

Page 62: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Tag transfer

The PROIEL corpus relies on multi-level manual annotation

Good-quality alignment makes it possible to minimize efforts bytransferring tags from one language to the others

Information structure: transfer of tags and anaphoric links, correction

Customized tagging: tagging in the Greek, transfer to the otherlanguages, correction

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 20 / 33

Page 63: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Transferring animacy tags

Lemma-level annotation of all Greek nouns

93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives

Transfer program:

finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma

About 95% success

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33

Page 64: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Transferring animacy tags

Lemma-level annotation of all Greek nouns

93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives

Transfer program:

finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma

About 95% success

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33

Page 65: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Transferring animacy tags

Lemma-level annotation of all Greek nouns

93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives

Transfer program:

finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma

About 95% success

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33

Page 66: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Transferring animacy tags

Lemma-level annotation of all Greek nouns

93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives

Transfer program:

finds all OCS nouns and adjectives that are translations of Greek nouns

finds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma

About 95% success

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33

Page 67: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Transferring animacy tags

Lemma-level annotation of all Greek nouns

93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives

Transfer program:

finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)

assigns that tag to the OCS lemma

About 95% success

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33

Page 68: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Transferring animacy tags

Lemma-level annotation of all Greek nouns

93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives

Transfer program:

finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma

About 95% success

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33

Page 69: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Transferring animacy tags

Lemma-level annotation of all Greek nouns

93% of all Greek nouns are translated into OCS nouns, the rest aremostly denominal adjectives

Transfer program:

finds all OCS nouns and adjectives that are translations of Greek nounsfinds the most frequently occurring of all the tags of the associatedGreek lemmata (since one OCS lemma may translate several Greekones)assigns that tag to the OCS lemma

About 95% success

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 21 / 33

Page 70: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Books and writings

êúíèãû comes out as NONCONC. Why?

Greek lemma occurrences animacy tag

biblion ‘document, book’ 6 CONCRETEbiblos ‘book’ 2 CONCRETEgramma ‘letter, piece of writing’ 2 CONCRETEgraphe ‘writing, scripture’ 20 NONCONC

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 22 / 33

Page 71: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Word order phenomena

The OCS translation notoriously follows the Greek word order, muchmore than the Greek syntax

The token alignments offer us a quick way of picking out the fewword order discrepancies

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 23 / 33

Page 72: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Word order phenomena

The OCS translation notoriously follows the Greek word order, muchmore than the Greek syntax

The token alignments offer us a quick way of picking out the fewword order discrepancies

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 23 / 33

Page 73: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy

Test case: The interrelationship between Greek definiteness and OCSanimacy marking

OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging

All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33

Page 74: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy

Test case: The interrelationship between Greek definiteness and OCSanimacy marking

OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging

All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33

Page 75: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy

Test case: The interrelationship between Greek definiteness and OCSanimacy marking

OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging

All genitive-formed tokens are tagged as morphological genitives

Regular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33

Page 76: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy

Test case: The interrelationship between Greek definiteness and OCSanimacy marking

OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging

All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)

Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33

Page 77: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy

Test case: The interrelationship between Greek definiteness and OCSanimacy marking

OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging

All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLs

Verbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33

Page 78: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy

Test case: The interrelationship between Greek definiteness and OCSanimacy marking

OCS object marking is represented by an interplay betweenmorphological, syntactic and semantic tagging

All genitive-formed tokens are tagged as morphological genitivesRegular transitive verbs take OBJs, also when genitive-marked(animacy, negation, partitivity)Verbs consistently requiring the genitive take OBLsVerbs with uncertain valency: accusative OBJs but take the supertagARG for genitive-marked objects

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 24 / 33

Page 79: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy and definiteness

Huntley 1993 (137–138): ‘There is a strong tendency for thegenitive-accusative to refer to a definite object, and for thenominative-accusative to refer to an indefinite object’.

Huntley counts objects with definite and indefinite reference

How do OCS objects correlate with Greek definite and indefiniteobjects?

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 25 / 33

Page 80: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy and definiteness

Huntley 1993 (137–138): ‘There is a strong tendency for thegenitive-accusative to refer to a definite object, and for thenominative-accusative to refer to an indefinite object’.

Huntley counts objects with definite and indefinite reference

How do OCS objects correlate with Greek definite and indefiniteobjects?

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 25 / 33

Page 81: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Animacy and definiteness

Huntley 1993 (137–138): ‘There is a strong tendency for thegenitive-accusative to refer to a definite object, and for thenominative-accusative to refer to an indefinite object’.

Huntley counts objects with definite and indefinite reference

How do OCS objects correlate with Greek definite and indefiniteobjects?

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 25 / 33

Page 82: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The data set

All OCS common and proper nouns with the following restrictions

Syntactic tag: OBJHas token alignmentAligned token animacy: humanAligned token part of speech: nounNot negatedGender: masculineNumber: singularLemma form: not a-stem noun

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 26 / 33

Page 83: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

The actual data set

ocs obj,ocs id,greek,greek id,ocs case,has article,noun type,info status,saliencyñúâ¢çüí³,550311,desmion,113682,g,false,Nb,new,0ñúâ¢çüí³,580761,desmion,286379,g,false,Nb

”0

âîèí ,542596,spekoulatora,105926,g,false,Nb,new,0ñúòúíèê ,550920,kenturiona,114298,g,true,Nb,old,0.00934579439252336æåíèõ ,539711,numphion,102949,g,true,Nb

”0

â ðòîëîì³ ,540216,Bartholomaion,103458,g,false,Ne,0.617214912280702ì òúòå ,540219,Maththaion,103460,g,false,Ne,acc inf,0.615376676986584âåëü¯³âîëú,540288,Beelzeboul,103524,a,false,Ne

”0

. . .

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 27 / 33

Page 84: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Greek and OCS token aligned objects

Greek object OCS accusative per cent OCS genitive per cent

definite 20 10.6% 168 89.4%indefinite 24 22.6% 82 77.4%

Table: Human token aligned OBJs, masc.sg., corrected for negation. P-value0.0099

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 28 / 33

Page 85: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Proper noun bias?

Animacy annotation can be crossed with morphological annotation forproper/common nouns

Proper nouns are at the top of the animacy hierarchy

Do proper and common nouns behave differently?

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 29 / 33

Page 86: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Proper noun bias?

Animacy annotation can be crossed with morphological annotation forproper/common nouns

Proper nouns are at the top of the animacy hierarchy

Do proper and common nouns behave differently?

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 29 / 33

Page 87: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Proper noun bias?

Animacy annotation can be crossed with morphological annotation forproper/common nouns

Proper nouns are at the top of the animacy hierarchy

Do proper and common nouns behave differently?

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 29 / 33

Page 88: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Proper nouns vs. common nouns

Greek object OCS acc. per cent OCS gen. per cent

proper definite 0 0% 57 100%proper indefinite 1 2.2% 45 97.8%common definite 20 15.3% 111 84.7%

common indefinite 23 37.5% 37 62.5%

Table: Human OBJs, masc.sg., corrected for negation and grouped by noun type,compared to their Greek alignments

Proper nouns in the nominative-accusative are very rare

The real difference is between definite and indefinite common nouns(P-value 0.0007)

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 30 / 33

Page 89: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Not animacy alone

The distribution of the genitive-accusative was not regulated byanimacy and social prominence alone

Discourse prominence also plays a role

Genitive-accusatives are more prone to be old or easily accessibleinformation than nominative-accusatives

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 31 / 33

Page 90: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Not animacy alone

The distribution of the genitive-accusative was not regulated byanimacy and social prominence alone

Discourse prominence also plays a role

Genitive-accusatives are more prone to be old or easily accessibleinformation than nominative-accusatives

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 31 / 33

Page 91: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Not animacy alone

The distribution of the genitive-accusative was not regulated byanimacy and social prominence alone

Discourse prominence also plays a role

Genitive-accusatives are more prone to be old or easily accessibleinformation than nominative-accusatives

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 31 / 33

Page 92: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Summary and conclusions

Automatic dictionary creation and token alignment

Technically useful: tag transfers

Powerful tool for contrastive linguistics

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 32 / 33

Page 93: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Summary and conclusions

Automatic dictionary creation and token alignment

Technically useful: tag transfers

Powerful tool for contrastive linguistics

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 32 / 33

Page 94: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Summary and conclusions

Automatic dictionary creation and token alignment

Technically useful: tag transfers

Powerful tool for contrastive linguistics

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 32 / 33

Page 95: University of Oslo · IntroductionCollocationsAlignmentTechnical usesLinguistic applicationsSummary and conclusions Exploiting translations Codex Marianus is a translation from Greek

Introduction Collocations Alignment Technical uses Linguistic applications Summary and conclusions

Availability

The corpus is available for everyone to use.

We publish XML files with raw data as well.

All our data is released under a Creative Commons license.

Visit http://www.hf.uio.no/ifikk/proiel/ for details.

Hanne Eckhoff, Dag Haug Automatic alignment and parallel corpora October 13, 2009 33 / 33