translationese - between human and machine translation

112
T RANSLATIONESE BETWEEN HUMAN AND MACHINE TRANSLATION Shuly Wintner Department of Computer Science University of Haifa Haifa, Israel [email protected]

Upload: others

Post on 01-Jan-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Translationese - Between Human and Machine Translation

TRANSLATIONESEBETWEEN HUMAN AND MACHINE TRANSLATION

Shuly Wintner

Department of Computer ScienceUniversity of Haifa

Haifa Israelshulycshaifaacil

Introduction

ORIGINAL OR TRANSLATION

EXAMPLE (O OR T)We want to see countries that can produce the best

product for the best price in that particular business I haveto agree with the member that free trade agreements bydefinition do not mean that we have to be less vigilant all of asudden

EXAMPLE (T OR O)I would like as my final point to say that we support free

trade but we must learn from past mistakes Let us hopethat negotiations for free trade agreements with the fourCentral American countries introduce a number of otherdimensions absent from these first generation agreements

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 2 112

Introduction

UNDERLYING ASSUMPTIONS

Research in Translation Studies can inform Natural LanguageProcessing and in particular improve the quality of machinetranslation

Computational methodologies can shed light on pertinent questionsin Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 3 112

Introduction

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 4 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

Translated texts differ from original ones

The differences do not indicate poor translation but rather astatistical phenomenon translationese (Gellerstam 1986)

Toury (1980 1995) defines two laws of translation

THE LAW OF INTERFERENCE Fingerprints of the source text thatare left in the translation product

THE LAW OF GROWING STANDARDIZATION Effort to standardizethe translation product according to existing norms in the targetlanguage and culture

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 5 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

TRANSLATION UNIVERSALS (Baker 1993)

ldquofeatures which typically occur in translated text ratherthan original utterances and which are not the result ofinterference from specific linguistic systemsrdquo

SIMPLIFICATION (Blum-Kulka and Levenston 1978 1983)

EXPLICITATION (Blum-Kulka 1986)

NORMALIZATION (Chesterman 2004)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 6 112

Introduction Translationese

TRANSLATIONESEWHY DOES IT MATTER

Language models for statistical machine translation(Lembersky et al 2011 2012b)

Translation models for statistical machine translation(Kurokawa et al 2009 Lembersky et al 2012a 2013)

Cleaning parallel corpora crawled from the Web(Eetemadi and Toutanova 2014 Aharoni et al 2014)

Understanding the properties of human translation (Ilisei et al2010 Ilisei and Inkpen 2011 Ilisei 2013 Volansky et al 2015Avner et al 2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 7 112

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 2: Translationese - Between Human and Machine Translation

Introduction

ORIGINAL OR TRANSLATION

EXAMPLE (O OR T)We want to see countries that can produce the best

product for the best price in that particular business I haveto agree with the member that free trade agreements bydefinition do not mean that we have to be less vigilant all of asudden

EXAMPLE (T OR O)I would like as my final point to say that we support free

trade but we must learn from past mistakes Let us hopethat negotiations for free trade agreements with the fourCentral American countries introduce a number of otherdimensions absent from these first generation agreements

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 2 112

Introduction

UNDERLYING ASSUMPTIONS

Research in Translation Studies can inform Natural LanguageProcessing and in particular improve the quality of machinetranslation

Computational methodologies can shed light on pertinent questionsin Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 3 112

Introduction

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 4 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

Translated texts differ from original ones

The differences do not indicate poor translation but rather astatistical phenomenon translationese (Gellerstam 1986)

Toury (1980 1995) defines two laws of translation

THE LAW OF INTERFERENCE Fingerprints of the source text thatare left in the translation product

THE LAW OF GROWING STANDARDIZATION Effort to standardizethe translation product according to existing norms in the targetlanguage and culture

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 5 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

TRANSLATION UNIVERSALS (Baker 1993)

ldquofeatures which typically occur in translated text ratherthan original utterances and which are not the result ofinterference from specific linguistic systemsrdquo

SIMPLIFICATION (Blum-Kulka and Levenston 1978 1983)

EXPLICITATION (Blum-Kulka 1986)

NORMALIZATION (Chesterman 2004)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 6 112

Introduction Translationese

TRANSLATIONESEWHY DOES IT MATTER

Language models for statistical machine translation(Lembersky et al 2011 2012b)

Translation models for statistical machine translation(Kurokawa et al 2009 Lembersky et al 2012a 2013)

Cleaning parallel corpora crawled from the Web(Eetemadi and Toutanova 2014 Aharoni et al 2014)

Understanding the properties of human translation (Ilisei et al2010 Ilisei and Inkpen 2011 Ilisei 2013 Volansky et al 2015Avner et al 2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 7 112

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 3: Translationese - Between Human and Machine Translation

Introduction

UNDERLYING ASSUMPTIONS

Research in Translation Studies can inform Natural LanguageProcessing and in particular improve the quality of machinetranslation

Computational methodologies can shed light on pertinent questionsin Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 3 112

Introduction

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 4 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

Translated texts differ from original ones

The differences do not indicate poor translation but rather astatistical phenomenon translationese (Gellerstam 1986)

Toury (1980 1995) defines two laws of translation

THE LAW OF INTERFERENCE Fingerprints of the source text thatare left in the translation product

THE LAW OF GROWING STANDARDIZATION Effort to standardizethe translation product according to existing norms in the targetlanguage and culture

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 5 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

TRANSLATION UNIVERSALS (Baker 1993)

ldquofeatures which typically occur in translated text ratherthan original utterances and which are not the result ofinterference from specific linguistic systemsrdquo

SIMPLIFICATION (Blum-Kulka and Levenston 1978 1983)

EXPLICITATION (Blum-Kulka 1986)

NORMALIZATION (Chesterman 2004)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 6 112

Introduction Translationese

TRANSLATIONESEWHY DOES IT MATTER

Language models for statistical machine translation(Lembersky et al 2011 2012b)

Translation models for statistical machine translation(Kurokawa et al 2009 Lembersky et al 2012a 2013)

Cleaning parallel corpora crawled from the Web(Eetemadi and Toutanova 2014 Aharoni et al 2014)

Understanding the properties of human translation (Ilisei et al2010 Ilisei and Inkpen 2011 Ilisei 2013 Volansky et al 2015Avner et al 2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 7 112

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 4: Translationese - Between Human and Machine Translation

Introduction

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 4 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

Translated texts differ from original ones

The differences do not indicate poor translation but rather astatistical phenomenon translationese (Gellerstam 1986)

Toury (1980 1995) defines two laws of translation

THE LAW OF INTERFERENCE Fingerprints of the source text thatare left in the translation product

THE LAW OF GROWING STANDARDIZATION Effort to standardizethe translation product according to existing norms in the targetlanguage and culture

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 5 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

TRANSLATION UNIVERSALS (Baker 1993)

ldquofeatures which typically occur in translated text ratherthan original utterances and which are not the result ofinterference from specific linguistic systemsrdquo

SIMPLIFICATION (Blum-Kulka and Levenston 1978 1983)

EXPLICITATION (Blum-Kulka 1986)

NORMALIZATION (Chesterman 2004)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 6 112

Introduction Translationese

TRANSLATIONESEWHY DOES IT MATTER

Language models for statistical machine translation(Lembersky et al 2011 2012b)

Translation models for statistical machine translation(Kurokawa et al 2009 Lembersky et al 2012a 2013)

Cleaning parallel corpora crawled from the Web(Eetemadi and Toutanova 2014 Aharoni et al 2014)

Understanding the properties of human translation (Ilisei et al2010 Ilisei and Inkpen 2011 Ilisei 2013 Volansky et al 2015Avner et al 2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 7 112

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 5: Translationese - Between Human and Machine Translation

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

Translated texts differ from original ones

The differences do not indicate poor translation but rather astatistical phenomenon translationese (Gellerstam 1986)

Toury (1980 1995) defines two laws of translation

THE LAW OF INTERFERENCE Fingerprints of the source text thatare left in the translation product

THE LAW OF GROWING STANDARDIZATION Effort to standardizethe translation product according to existing norms in the targetlanguage and culture

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 5 112

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

TRANSLATION UNIVERSALS (Baker 1993)

ldquofeatures which typically occur in translated text ratherthan original utterances and which are not the result ofinterference from specific linguistic systemsrdquo

SIMPLIFICATION (Blum-Kulka and Levenston 1978 1983)

EXPLICITATION (Blum-Kulka 1986)

NORMALIZATION (Chesterman 2004)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 6 112

Introduction Translationese

TRANSLATIONESEWHY DOES IT MATTER

Language models for statistical machine translation(Lembersky et al 2011 2012b)

Translation models for statistical machine translation(Kurokawa et al 2009 Lembersky et al 2012a 2013)

Cleaning parallel corpora crawled from the Web(Eetemadi and Toutanova 2014 Aharoni et al 2014)

Understanding the properties of human translation (Ilisei et al2010 Ilisei and Inkpen 2011 Ilisei 2013 Volansky et al 2015Avner et al 2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 7 112

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 6: Translationese - Between Human and Machine Translation

Introduction Translationese

TRANSLATIONESETHE LANGUAGE OF TRANSLATED TEXTS

TRANSLATION UNIVERSALS (Baker 1993)

ldquofeatures which typically occur in translated text ratherthan original utterances and which are not the result ofinterference from specific linguistic systemsrdquo

SIMPLIFICATION (Blum-Kulka and Levenston 1978 1983)

EXPLICITATION (Blum-Kulka 1986)

NORMALIZATION (Chesterman 2004)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 6 112

Introduction Translationese

TRANSLATIONESEWHY DOES IT MATTER

Language models for statistical machine translation(Lembersky et al 2011 2012b)

Translation models for statistical machine translation(Kurokawa et al 2009 Lembersky et al 2012a 2013)

Cleaning parallel corpora crawled from the Web(Eetemadi and Toutanova 2014 Aharoni et al 2014)

Understanding the properties of human translation (Ilisei et al2010 Ilisei and Inkpen 2011 Ilisei 2013 Volansky et al 2015Avner et al 2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 7 112

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 7: Translationese - Between Human and Machine Translation

Introduction Translationese

TRANSLATIONESEWHY DOES IT MATTER

Language models for statistical machine translation(Lembersky et al 2011 2012b)

Translation models for statistical machine translation(Kurokawa et al 2009 Lembersky et al 2012a 2013)

Cleaning parallel corpora crawled from the Web(Eetemadi and Toutanova 2014 Aharoni et al 2014)

Understanding the properties of human translation (Ilisei et al2010 Ilisei and Inkpen 2011 Ilisei 2013 Volansky et al 2015Avner et al 2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 7 112

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 8: Translationese - Between Human and Machine Translation

Introduction Corpus-based Translation Studies

CORPUS-BASED TRANSLATION STUDIES

EXAMPLE (SUN AND SHREVE (2013))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 8 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 9: Translationese - Between Human and Machine Translation

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

Translated texts exhibit lower lexical variety (type-to-token ratio)than originals (Al-Shabab 1996)

Their mean sentence length and lexical density (ratio of content tonon-content words) are lower (Laviosa 1998)

Corpus-based evidence for the simplification hypothesis (Laviosa2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 9 112

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 10: Translationese - Between Human and Machine Translation

Introduction Corpus-based Translation Studies

COMPUTATIONAL INVESTIGATION OF TRANSLATIONESE

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 10 112

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 11: Translationese - Between Human and Machine Translation

Identification of Translationese

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 11 112

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 12: Translationese - Between Human and Machine Translation

Identification of Translationese

METHODOLOGY

Corpus-based approach

Text classification with (supervised) machine-learning techniques

Feature design

Evaluation ten-fold cross-validation

Unsupervised classification

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 12 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 13: Translationese - Between Human and Machine Translation

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

EXAMPLE (FROM HE ET AL (2012))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 13 112

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 14: Translationese - Between Human and Machine Translation

Identification of Translationese

TEXT CLASSIFICATION WITH MACHINE LEARNING

Supervised machine learning a training corpus lists instancesof both classes

Each instance in the two classes is represented by a set of numericfeatures that are extracted from the instances

A generic machine-learning algorithm is trained to distinguishbetween feature vectors representative of one class and thoserepresentative of the other

The trained classifier is tested on an lsquounseenrsquo text the test set

Classifiers assign weights to the features

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 14 112

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 15: Translationese - Between Human and Machine Translation

Identification of Translationese

TEXT CLASSIFICATION (AUTHORSHIP ATTRIBUTION)APPLICATIONS

Determine the genderage of the author (Koppel et al 2003)

Tell Shakespeare from Marlowe (Juola 2006)

Identify suicide letters

Spot plagiarism

Filter out spam

Identify the native language of non-native authors (Tetreault et al2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 15 112

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 16: Translationese - Between Human and Machine Translation

Identification of Translationese

IDENTIFYING TRANSLATIONESEUSING TEXT CLASSIFICATION

Baroni and Bernardini (2006)

van Halteren (2008)

Kurokawa et al (2009)

Ilisei et al (2010) Ilisei and Inkpen (2011) Ilisei (2013)

Koppel and Ordan (2011)

Popescu (2011)

Volansky et al (2015) Avner et al (2016)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 16 112

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 17: Translationese - Between Human and Machine Translation

Identification of Translationese

EXPERIMENTAL SETUP

Corpus EUROPARL (Koehn 2005)

4 million tokens in English (O)

400000 tokens translated from each of the ten source languages(T) Danish Dutch Finnish French German Greek ItalianPortuguese Spanish and Swedish

The corpus is tokenized and then partitioned into chunks ofapproximately 2000 tokens (ending on a sentence boundary)

Part-of-speech tagging

Classification with Weka (Hall et al 2009) using SVM with adefault linear kernel

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 17 112

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 18: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 18 112

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 19: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses

HYPOTHESES

SIMPLIFICATION Rendering complex linguistic features in the sourcetext into simpler features in the target (Blum-Kulka and Levenston1983 Vanderauwerea 1985 Baker 1993)

EXPLICITATION The tendency to spell out in the target text utterancesthat are more implicit in the source (Blum-Kulka 1986 Oslashveras 1998Baker 1993)

NORMALIZATION Efforts to standardize texts (Toury 1995) ldquoa strongpreference for conventional grammaticalityrdquo (Baker 1993)

INTERFERENCE The fingerprints of the source language on thetranslation output (Toury 1979)

MISCELLANEOUS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 19 112

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 20: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses

FEATURES SHOULD

1 Reflect frequent linguistic characteristics we would expect to bepresent in the two types of text

2 Be content-independent indicating formal and stylistic differencesbetween the texts that are not derived from differences in contentsdomain genre etc

3 Be easy to interpret yielding insights regarding the differencesbetween original and translated texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 20 112

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 21: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses Simplification

SIMPLIFICATION

LEXICAL VARIETY Three different type-token ratio (TTR) measureswhere V is the number of types and N is the number of tokens perchunk V1 is the number of types occurring only once in the chunk

VN log(V )log(N) 100times log(N)(1minusV1V)

MEAN WORD LENGTH In characters

SYLLABLE RATIO the number of vowel-sequences that are delimited byconsonants or space in a word

MEAN SENTENCE LENGTH In words

LEXICAL DENSITY The frequency of tokens that are not nounsadjectives adverbs or verbs

MEAN WORD RANK The average rank of words in a frequency-orderedlist of 6000 English most frequent words

MOST FREQUENT WORDS The frequencies of the N most frequentwords N = 51050

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 21 112

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 22: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses Explicitation

EXPLICITATION

EXAMPLE (EXPLICITATION)

T Oisraelische Ministerprasident Benjamin Netanjahu Merkel

EXPLICIT NAMING The ratio of personal pronouns to proper nounsthis models the tendency to spell out pronouns

SINGLE NAMING The frequency of proper nouns consisting of a singletoken not having an additional proper noun as a neighbor

MEAN MULTIPLE NAMING The average length (in tokens) of propernouns

COHESIVE MARKERS The frequencies of several cohesive markers(because but hence in fact therefore )

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 22 112

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 23: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses Normalization

NORMALIZATION

REPETITIONS The frequency of content words (words tagged as nounsverbs adjectives or adverbs) that occur more than once in a chunk

EXAMPLE (REPETITIONS BEN-ARI (1998))O Manchmal denke ich Haben Sie vielleicht ein kaltes Herz HabenSie wirklich ein kaltes Herz

T Je pense quelquefois aurait-elle un cœur insensible Avez-vousvraiment un cœur de glace

CONTRACTIONS Ratio of contracted forms to their counterpart fullform

AVERAGE PMI The average PMI (Church and Hanks 1990) of allbigrams in the chunk

THRESHOLD PMI Number of bigrams in the chunk whose PMI isabove 0

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 23 112

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 24: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses Interference

INTERFERENCE

Fingerprints of the source text ldquosource language shining throughrdquo(Teich 2003)

Not necessarily a mark of bad translation Rather a differentdistribution of elements in T and O

Positive vs negative interference

EXAMPLE (INTERFERENCE)POSITIVE doch is under-represented in translation to German

NEGATIVE One is is over-represented in translations from German toEnglish (triggered by Man ist)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 24 112

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 25: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses Interference

INTERFERENCE

POS n-GRAMS Unigrams bigrams and trigrams of POS tags modelingvariations in syntactic structure

CHARACTER n-GRAMS Unigrams bigrams and trigrams of charactersmodeling shallow morphology

PREFIXES AND SUFFIXES The number of words in a chunk that beginor end with each member of a list of prefixessuffixes respectively

CONTEXTUAL FUNCTION WORDS The frequencies in the chunk oftriplets 〈w1w2w3〉 where at least two of the elements are functionwords and at most one is a POS tag

POSITIONAL TOKEN FREQUENCY The frequency of tokens appearing inthe first second antepenultimate penultimate and last positions in asentence

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 25 112

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 26: Translationese - Between Human and Machine Translation

Translation Studies Hypotheses Miscellaneous

MISCELLANEOUS

FUNCTION WORDS The list of Koppel and Ordan (2011)

PRONOUNS Frequency of the occurrences of pronouns in the chunk

PUNCTUATION - ( ) [ ] lsquo rsquo ldquo rdquo

1 The normalized frequency of each punctuation mark in the chunk2 A non-normalized notion of frequency ntokens3 nfraslp where p is the total number of punctuations in the chunk

RATIO OF PASSIVE FORMS TO ALL VERBS

SANITY CHECK Token unigrams and bigrams

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 26 112

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 27: Translationese - Between Human and Machine Translation

Supervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 27 112

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 28: Translationese - Between Human and Machine Translation

Supervised Classification

RESULTS SANITY CHECK

Category Feature Accuracy ()

SanityToken unigrams 100Token bigrams 100

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 28 112

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 29: Translationese - Between Human and Machine Translation

Supervised Classification Simplification

RESULTS SIMPLIFICATION

Category Feature Accuracy ()

Simplification

TTR (1) 72TTR (2) 72TTR (3) 76Mean word rank (1) 69Mean word rank (2) 77N most frequent words 64Mean word length 66Syllable ratio 61Lexical density 53Mean sentence length 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 29 112

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 30: Translationese - Between Human and Machine Translation

Supervised Classification Simplification

ANALYSIS SIMPLIFICATION

Lexical density fails altogether to predict the status of a text beingnearly on chance level (53 accuracy)

The first two TTR measures perform relatively well (72) and theindirect measures of lexical variety (mean word length 66 andsyllable ratio 61) are above chance level

Mean word rank (77) is closely related to the feature studied byLaviosa (1998) (n top words) with two differences our list offrequent items is much larger and we generate the frequency listnot from the corpora under study but rather from an externalmuch larger reference corpus

In contrast the design that follows Laviosa (1998) more strictly (nmost frequent words) has a lower predictive power (64)

While mean sentence length is much above chance level (65) theresults are contrary to common assumptions in Translation Studies

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 30 112

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 31: Translationese - Between Human and Machine Translation

Supervised Classification Simplification

SIMPLIFICATION

EXAMPLE (MEAN SENTENCE LENGTH PER lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 31 112

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 32: Translationese - Between Human and Machine Translation

Supervised Classification Explicitation

RESULTS EXPLICITATION

Category Feature Accuracy ()

Explicitation

Cohesive Markers 81Explicit naming 58Single naming 56Mean multiple naming 54

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 32 112

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 33: Translationese - Between Human and Machine Translation

Supervised Classification Explicitation

ANALYSIS EXPLICITATION

Explicit naming single naming and mean multiple naming do notexceed 58 accuracy

On the other hand the classifier that uses 40 cohesive markers(Blum-Kulka 1986 Koppel and Ordan 2011) achieves 81accuracy

Such cohesive markers are far more frequent in T than in O

For example moreover is used 175 times more frequently in Tthan in O thus 4 times more frequently and besides 38 times

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 33 112

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 34: Translationese - Between Human and Machine Translation

Supervised Classification Normalization

RESULTS NORMALIZATION

Category Feature Accuracy ()

Normalization

Repetitions 55Contractions 50Average PMI 52Threshold PMI 66

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 34 112

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 35: Translationese - Between Human and Machine Translation

Supervised Classification Normalization

ANALYSIS NORMALIZATION

None of these features perform very well

Repetitions and contractions are rare in EUROPARL the corpusmay not be suited for studying these phenomena

Threshold PMI designed to pick on highly associated words andtherefore attesting to many fixed expressions performsconsiderably better at 66 accuracy but contrary to thehypothesis (Kenny 2001)

English has far more highly associated bigrams than translationsO has stand idly stand firm stand trial etc conversely Tincludes poorly associated bigrams such as stand unamended

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 35 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 36: Translationese - Between Human and Machine Translation

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (NUMBER OF BIGRAMS WHOSE PMI IS ABOVETHRESHOLD ACCORDING TO lsquoLANGUAGErsquo)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 36 112

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 37: Translationese - Between Human and Machine Translation

Supervised Classification Normalization

NORMALIZATION

EXAMPLE (TRANSLATED LANGUAGE IS LESS PREDICTABLE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 37 112

Supervised Classification Interference

RESULTS INTERFERENCE

Category Feature Accuracy ()

Interference

POS unigrams 90POS bigrams 97POS trigrams 98Character unigrams 85Character bigrams 98Character trigrams 100Prefixes and suffixes 80Contextual function words 100Positional token frequency 97

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 38 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

The interference-based classifiers are the best performing ones

Interference is the most robust phenomenon typifying translations

The character n-gram findings are consistent with Popescu (2011)they catch on both affixes and function words

For example typical trigrams in O are -ion and all whereas typicalto T are -ble and the

Part-of-speech trigrams is an extremely cheap and efficient classifier

For example MD+VB+VBN eg must be taken should begiven can be used

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 39 112

Supervised Classification Interference

INTERFERENCE

EXAMPLE (THE AVERAGE NUMBER OF THE POS TRIGRAMMODAL + VERB BASE FORM + PARTICIPLE IN O AND TEN TS)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 40 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Positional token frequency yields 97 accuracy

The second most prominent feature typifying O is sentencesopening with the word lsquoButrsquo there are 225 times more cases ofsuch sentences in O

A long prescriptive tradition in English forbids writers to open asentence with lsquoButrsquo and translators are conservative in their lexicalchoices (Kenny 2001)

This is in fact standardization rather than interference

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 41 112

Supervised Classification Interference

ANALYSIS INTERFERENCE

Some interference features are coarse and may reflectcorpus-dependent characteristics

For example some of the top character n-gram features in Oinclude sequences that are lsquoillegalrsquo in English and obviously stemfrom foreign names Haarder and Maat or Gazpron

To offset this problem we use only the top 300 features in severalof the classifiers with a minor effect on the results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 42 112

Supervised Classification Interference

RESULTS REDUCED PARAMETER SPACE(300 MOST FREQUENT FEATURES)

Category Feature Accuracy

Interference

POS bigrams 96POS trigrams 96Character bigrams 95Character trigrams 96Positional token frequency 93

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 43 112

Supervised Classification Miscellaneous

RESULTS MISCELLANEOUS

Category Feature Accuracy ()

Miscellaneous

Function words 96Punctuation (1) 81Punctuation (2) 85Punctuation (3) 80Pronouns 77Ratio of passive forms to all verbs 65

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 44 112

Supervised Classification Miscellaneous

ANALYSIS MISCELLANEOUS

The function words classifier replicates Koppel and Ordan (2011)despite the good performance (96 accuracy) it is not verymeaningful theoretically

Subject pronouns like I he and she are prominent indicators of Owhereas virtually all reflexive pronouns (such as itself himself yourself ) typify T

T has about 115 times more passive verbs but it is highlydependent on the source language

Punctuation marks are good predictors

The mark lsquorsquo (actually indicating sentence length) is a strongfeature of O lsquorsquo is a strong marker of T

In fact these two features alone yield 79 accuracy

Parentheses are very typical to T indicating explicitation

Translations from German use many more exclamation marks 276times more than original English

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 45 112

Supervised Classification Miscellaneous

ANALYSIS PUNCTUATION

EXAMPLE (PUNCTUATION MARKS ACROSS O AND T)Frequency

Mark O T Ratio LL Weight

3783 4979 076 T T1( 042 072 058 T T2rsquo 194 253 077 T T3) 040 072 056 T T4 030 030 100 mdash mdash[ 001 002 045 T mdash] 001 002 044 T mdashrdquo 033 022 146 O O7 022 017 125 O O6 3820 3460 110 O O5 120 117 103 mdash O4 084 083 101 mdash O3 157 111 141 O O2- 268 225 119 O O1

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 46 112

Unsupervised Classification

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 47 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

Inherently dependent on data annotated with the translationdirection

May not be generalized to unseen (related or unrelated) domainsmodality (written vs spoken) register genre date etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 48 112

Unsupervised Classification Motivation

DATASETS

Europarl the proceedings of the European Parliament between theyears 2001-2006

the Canadian Hansard transcripts of the Canadian Parliamentspanning years 2001-2009

literary classics written (or translated) mainly in the 19th century

transcripts of TED and TEDx talks

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 49 112

Unsupervised Classification Motivation

SUPERVISED CLASSIFICATION

feature corpus EUR HAN LIT TED

FW 963 981 973 977char-trigrams 988 971 995 1000POS-trigrams 985 972 987 920contextual FW 952 968 941 863cohesive markers 836 869 786 818

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 50 112

Unsupervised Classification Motivation

PAIRWISE CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR 608 562 963HAN 597 587 981LIT 643 615 973

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 51 112

Unsupervised Classification Motivation

LEAVE-ONE-OUT CROSS-DOMAIN CLASSIFICATIONUSING FUNCTION WORDS

train test EUR HAN LIT X-validation

EUR + HAN 638 940EUR + LIT 641 929HAN + LIT 598 960

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 52 112

Unsupervised Classification Unsupervised Classification

CLUSTERINGASSUMING GOLD LABELS

feature corpus EUR HAN LIT TED

FW 886 889 788 875char-trigrams 721 638 703 786POS-trigrams 969 760 707 761contextual FW 929 932 682 670cohesive markers 631 812 671 630

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 53 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

Labeling determines which of the clusters is O and which is T

Clustering can divide observations into classes but cannot labelthose classes

Let Om (O-markers) denote a set of function words that tend to beassociated with O Tm (T-markers) is a set of words typical of T

Create unigram language models of O and T

p(w | Om) = tf (w)+ε

|Om|+εtimes|V |p(w | Tm) = tf (w)+ε

|Tm|+εtimes|V |

The similarity between a class X (either O or T) and a cluster Ci isdetermined using the Jensen-Shannon divergence (JSD) (Lin 1991)

DJS(X Ci ) = 2

radicJSD(PX ||PCi

)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 54 112

Unsupervised Classification Cluster Labeling

CLUSTER LABELING

The assignment of the label X to the cluster C1 is then supportedby both C1rsquos proximity to the class X and C2rsquos proximity to theother class

label(C1)=

ldquoOrdquo if DJS(OC1)timesDJS(T C2)lt

DJS(OC2)timesDJS(T C1)

ldquoTrdquo otherwise

C2 is then assigned the complementary label

We select O- and T-markers from a random sample of Europarland Hansard texts using 600 chunks from each corpus

Labeling precision is 100 in all clustering experiments

This facilitates majority voting of several feature sets

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 55 112

Unsupervised Classification Clustering Consensus among Feature Sets

CLUSTERING CONSENSUS

method corpus EUR HAN LIT TED

FW 886 889 788 875FWchar-trigramsPOS-trigrams

911 862 782 909

FWPOS-trigramscontextual FW

958 898 723 863

FWchar-trigramsPOS-trigramscontextual FWcohesive markers

941 910 792 886

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 56 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 57 112

Unsupervised Classification Sensitivity Analysis

SENSITIVITY ANALYSIS

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 58 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

The in-domain discriminative features of translated texts cannot beeasily generalized to other even related domains

Hypothesis domain-specific features overshadow the features oftranslationese

We mix 800 chunks each from Europral and Hansardyielding 1600 chunks half of them O and half T

Running the clustering algorithm on this dataset yields perfectseparation by domain (and chance-level identification oftranslationese)

Adding the literature corpus and clustering to three clusters yieldsthe same results

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 59 112

Unsupervised Classification Mixed-domain Classification

MIXED-DOMAIN CLASSIFICATION

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

KMeansaccuracy by domain 937 995 998 922XMeansgenerated of clusters 2 2 3 3accuracy by domain 936 995 ndash 922

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 60 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

Given a set of text chunks that come from multiple domains suchthat some chunks are O and some are T the task is to classify thetexts to O vs T independently of their domain

A two-phase approach

A flat approach

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 61 112

Unsupervised Classification Mixed-domain Classification

CLUSTERING IN A MIXED-DOMAIN SETUP

EUREUR EUR HAN HAN

method corpus HAN LIT LIT LIT

Flat 925 607 775 668Two-phase 913 794 853 675

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 62 112

Applications for Machine Translation

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 63 112

Applications for Machine Translation

APPLICATIONS FOR MACHINE TRANSLATION

Fundamentals of statistical machine translation (SMT)

Language models

Translation models

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 64 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

MotivationWhen I look at an article in Russian I say ldquoThis is reallywritten in English but it has been coded in some strangesymbols I shall now proceed to decoderdquo

Warren Weaver 1955

The noisy channel model

The best translation T of a source sentence S is the targetsentence T that maximizes some function combining thefaithfulness of (T S) and the fluency of T

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 65 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

Standard notation Translating a foreign sentence F = f1 middot middot middot fminto an English sentence E = e1 middot middot middot elThe best translation

E = arg maxE P(E | F )

= arg maxEP(F |E)timesP(E)

P(F )

= arg maxE P(F | E )timesP(E )

The noisy channel thus requires two components a translationmodel and a language model

E = arg maxEisinEnglish

P(F | E )︸ ︷︷ ︸Translation model

times P(E )︸ ︷︷ ︸Language model

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 66 112

Applications for Machine Translation Fundamentals of SMT

FUNDAMENTALS OF SMT

A language model to estimate P(E ) (estimated from amonolingual E corpus)

A translation model to estimate P(F | E ) (estimated from abilingual parallel corpus)

A decoder that given F can produce the most probable E

Evaluation BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 67 112

Applications for Machine Translation Language Models

LANGUAGE MODELSRESEARCH QUESTIONS

Test the fitness of language models compiled from translated textsvs the fitness of LMs compiled from original texts

Test the fitness of language models compiled from texts translatedfrom other languages

Test if language models compiled from translated texts are betterfor MT than LMs compiled from original texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 68 112

Applications for Machine Translation Language Models

LANGUAGE MODELSMETHODOLOGY

The fitness of a language model to a reference corpus is evaluatedusing perplexity

PP(LMw1w2 wN) = N

radicN

prodi=1

1

PLM(wi | w1 wiminus1)

Train SMT systems (Koehn et al 2007) using different LMs andevaluate their quality on a reference set

Quality is measured in terms of BLEU scores (Papineni et al 2002)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 69 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

German to English translationsOrig Lang 1-gram 2-gram 3-gram 4-gram

Mix 45150 9300 6936 6647O-EN 46809 10374 7957 7679T-DE 44314 8848 6499 6207T-FR 46098 9990 7623 7338T-IT 46589 10231 7850 7567T-NL 45702 9734 7354 7056

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 70 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

French to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 47205 9904 7560 7268O-EN 50056 11548 9114 8831T-DE 48678 10850 8439 8141T-FR 46358 9459 7124 6837T-IT 47605 10269 7923 7636T-NL 49009 11067 8661 8355

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 71 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Italian to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 39599 8846 6735 6440O-EN 41547 9992 7927 7634T-DE 40464 9522 7373 7085T-FR 39599 8944 6838 6554T-IT 38455 8190 6085 5791T-NL 41158 9878 7698 7394

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 72 112

Applications for Machine Translation Language Models

PERPLEXITY RESULTS

Dutch to English translations

Orig Lang 1-gram 2-gram 3-gram 4-gramMix 43489 9073 6905 6608O-EN 44811 10017 7823 7546T-DE 43768 9367 7154 6857T-FR 44500 9732 7559 7255T-IT 44811 10019 7806 7519T-NL 42313 8399 6217 5909

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 73 112

Applications for Machine Translation Language Models

LANGUAGE MODELSABSTRACTION

No punctuation

No named entities

No nouns

No words only parts of speech

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 74 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

No PunctuationOrig Lang Perplexity Improvement ()

MIX 10591 1973O-EN 13194T-DE 12250 716T-FR 9952 2458T-IT 11271 1458T-NL 12644 417

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 75 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Named Entity AbstractionOrig Lang Perplexity Improvement ()

MIX 9388 1851O-EN 11520T-DE 10748 670T-FR 8896 2277T-IT 9917 1391T-NL 11072 389

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 76 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Noun AbstractionOrig Lang Perplexity Improvement ()

MIX 3602 1134O-EN 4062T-DE 3867 481T-FR 3475 1446T-IT 3685 930T-NL 3944 291

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 77 112

Applications for Machine Translation Language Models

ABSTRACTION RESULTS

Part-of-speech AbstractionOrig Lang Perplexity Improvement ()

MIX 799 266O-EN 820T-DE 808 147T-FR 789 377T-IT 800 247T-NL 811 111

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 78 112

Applications for Machine Translation Language Models

MACHINE TRANSLATION RESULTS

DE to ENLM BLEU

MIX 2143O-EN 2110T-DE 2190T-FR 2116T-IT 2129T-NL 2120

FR to ENLM BLEU

MIX 2867O-EN 2798T-DE 2801T-FR 2914T-IT 2875T-NL 2811

IT to ENLM BLEU

MIX 2541O-EN 2469T-DE 2462T-FR 2537T-IT 2596T-NL 2477

NL to ENLM BLEU

MIX 2420O-EN 2340T-DE 2426T-FR 2356T-IT 2387T-NL 2452

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 79 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 80 112

Applications for Machine Translation Language Models

DOES SIZE MATTER

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 81 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESCOHESIVE MARKERS

SOURCE Enfin ce qui est grave dans le rapport de M Olivier crsquoestqursquoil propose une constitution tripotage

O Finally which is serious in the report of Mr Olivier is that itproposes a constitution tripotage

T Finally and this is serious in the report by Mr olivier it is because itproposes a constitution tripotage

SOURCE Crsquoest quand meme quelque chose de precieux qui a etesouligne par tous les membres du conseil europeen

O Even when it is something of valuable which has been pointed out byall the members of the European Council

T It is nevertheless something of a valuable which has been pointedout by all the members of the European Council

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 82 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESVERBS

SOURCE Une telle Europe serait un gage de paix et marquerait le refusde tout nationalisme ethnique

O Such a Europe would be a show of peace and would the rejection ofany ethnic nationalism

T Such a Europe would be a show of peace and would mark therefusal of all ethnic nationalism

SOURCE Votre rapport madame Sudre met lrsquoaccent a juste titre surla necessite drsquoagir dans la duree

O Your report Mrs Sudre its emphasis quite rightly on the need toact in the long term

T Your report Mrs Sudre places the emphasis quite rightly on theneed to act in the long term

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 83 112

Applications for Machine Translation Language Models

TRANSLATION EXAMPLESINTERFERENCE

SOURCE On ne dit rien non plus sur la responsabilite des fabricantsnotamment en grande-bretagne qui ont ete les premiers responsables

O We do not say nothing more on the responsibility of themanufacturers particularly in Britain which were the first responsible

T We do not say anything either on the responsibility of themanufacturers particularly in great Britain who were the firstresponsible

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 84 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSRESEARCH QUESTIONS

Are parallel corpora (manually) translated in the same direction ofthe MT task better than ones directed in the other direction

If corpora consisting of texts (manually) translated in bothdirections are available how to build a translation model adaptedto the unique properties of the translated text

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 85 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task S rarr T T rarr S

FR-EN 3364 3088EN-FR 3211 3035DE-EN 2653 2367EN-DE 1696 1617IT-EN 2870 2684EN-IT 2381 2128

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 86 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task French-to-EnglishCorpus subset S rarr T T rarr S

250K 3435 3133500K 3521 3238750K 3612 3290

1M 3573 3307125M 3624 3323

15M 3643 3373

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 87 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDIRECTION MATTERS

Task English-to-FrenchCorpus subset S rarr T T rarr S

250K 2774 2658500K 2915 2719750K 2943 2763

1M 2994 2788125M 3063 2784

15M 2989 2783

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 88 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSDOMAIN ADAPTATION

GOAL Given any bi-text consisting of both S rarr T and T rarr S subsetsimprove translation quality by taking advantage of informationpertaining to the direction of translation

TECHNIQUESUnion simple concatenation of corporaTwo phrase-tables train a phrase table for each subset and passboth to MOSESPhrase table interpolation using perplexity minimization(Sennrich 2012)Add a feature in the phrase table indicating the direction oftranslation

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 89 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task French-to-EnglishSystem MIX MIX-EO MIX-FO

S rarr T 3521 3521 3573UNION 3527 3536 3594PPLMIN-1 3546 3559 3626PPLMIN-2 3575 3565 3620CrEnt 3554 3545 3675PplRatio 3559 3578 3622

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 90 112

Applications for Machine Translation Translation Models

TRANSLATION MODELSADAPTATION RESULTS

Task English-to-FrenchSystem MIX MIX-FO MIX-EO

S rarr T 2915 2915 2994UNION 2927 2944 3001PPLMIN-1 2964 2994 2965PPLMIN-2 2950 3045 3012CrEnt 2947 2945 3044PplRatio 2965 2962 3034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 91 112

The Power of Interference

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 92 112

The Power of Interference

THE POWER OF INTERFERENCE

Translations to English from L1 differ from translations to Englishfrom L2

Translations from two languages are more similar to each otherwhen the two source languages are closer

The native language of ESL speakers can be accurately identifiedby looking at their English texts

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 93 112

The Power of Interference Cross-classification

CROSS-CLASSIFICATION

A classifier is trained to distinguish between O and T translatedfrom L1 it is then used to distinguish O from T translated from L2

(Koppel and Ordan 2011)

EXAMPLE (TRAIN ON L1 TEST ON L2)

IT FR ES DE FIIT 99 93 91 75 63FR 90 98 90 83 70ES 85 90 98 82 69DE 73 74 74 98 70FI 58 70 64 81 99

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 94 112

The Power of Interference Clustering of Translations

CLUSTERING OF TRANSLATIONS FROM SEVERAL

LANGUAGES

First translations to English from 14 European languages wereused for cross-classification using POS trigrams as features

The accuracy of classification was 76 (compare to a baseline of10014=7)

Then an unsupervised hierarchical clustering algorithm groupedtogether English (and French) texts translated from 16 Europeanlanguages

Texts were represented using feature vectors similarly to thesupervised experiments

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 95 112

The Power of Interference Clustering of Translations

CROSS-CLASSIFICATION

EXAMPLE (CONFUSION MATRIX OF 14-WAY CLASSIFICATION OFENGLISH (LEFT) AND FRENCH (RIGHT) TRANSLATIONESE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 96 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (ldquoGOLDrdquo TREE)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 97 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (TREES OBTAINED WITH POS-TRIGRAMS ENGLISH(LEFT) AND FRENCH (RIGHT))

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 98 112

The Power of Interference Clustering of Translations

THE EUROPEAN TREE OF LANGUAGES

EXAMPLE (EVALUATION OF TYPOLOGY TREES AGAINST GOLDTREE)

Dist DistwFeature EN T FR T EN T FR T

POS-trigrams 102 116 107 134positional tokens 110 115 130 137function words 132 182 143 190cohesive markers 195 195 215 203feature combination 100 146 113 158random tree 200 195

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 99 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

Given a set of essays composed by ESL students identify theauthorsrsquo native language (out of 11 languages)

The dataset consists of TOEFL essays contributed by the ETS

Each text is annotated by the prompt the proficiency level and thenative language

The task is to identify the native language

Features

Results (Tsvetkov et al 2013)

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 100 112

The Power of Interference Native Language Identification

NATIVE LANGUAGE IDENTIFICATION

ARA CHI FRE GER HIN ITA JPN KOR SPA TEL TUR P () R () F1

ARA 80 0 2 1 3 4 1 0 4 2 3 808 800 804CHI 3 80 0 1 1 0 6 7 1 0 1 889 800 842FRE 2 2 81 5 1 2 1 0 3 0 3 862 810 835GER 1 1 1 93 0 0 0 1 1 0 2 877 930 903HIN 2 0 0 1 77 1 0 1 5 9 4 748 770 759ITA 2 0 3 1 1 87 1 0 3 0 2 821 870 845JPN 2 1 1 2 0 1 87 5 0 0 1 784 870 825KOR 1 5 2 0 1 0 9 81 1 0 0 802 810 806SPA 2 0 2 0 1 8 2 1 78 1 5 772 780 776TEL 0 1 0 0 18 1 2 1 1 73 3 859 730 789TUR 4 0 2 2 0 2 2 4 4 0 80 769 800 784

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 101 112

Conclusion

OUTLINE

1 INTRODUCTION

2 IDENTIFICATION OF TRANSLATIONESE

3 TRANSLATION STUDIES HYPOTHESES

4 SUPERVISED CLASSIFICATION

5 UNSUPERVISED CLASSIFICATION

6 APPLICATIONS FOR MACHINE TRANSLATION

7 THE POWER OF INTERFERENCE

8 CONCLUSION

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 102 112

Conclusion

CONCLUSION

Machines can accurately identify translated texts

Translation ldquouniversalsrdquo should be reconsidered Not only are theydependent on genre and register they also vary greatly acrossdifferent pairs of languages

The best performing features are those that attest to thelsquofingerprintsrsquo of the source on the target

Interference a pair-specific phenomenon dominates othermanifestations of translationese

Translationese features are overshadowed by more salient featuresof the text including genre register domain etc

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 103 112

Conclusion

CONCLUSION

Language models compiled from translated texts are better forSMT than ones compiled from original texts

Translation models translated in the same direction as that of theSMT task are better than ones translated in the reverse direction

Translation models can be adapted to translationese therebyimproving the quality of SMT

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 104 112

Conclusion

FUTURE DIRECTIONS

The features of Hebrew translationese morphological markers(Avner et al 2016)

The features of machine translationese

Robust identification of machine translation output

More generally the similarities and differences among variousinterference phenomena

TranslationNon-native speakersLearnersrsquo productions

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 105 112

Conclusion

ACKNOWLEDGEMENTS

Gennadi Lembersky Noam Ordan Ella Rabinovich Vered Volansky

Israel Ministry of Science and Technology

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 106 112

Conclusion

MAIN PUBLICATIONS

Ella Rabinovich and Shuly Wintner Unsupervised Identification ofTranslationese Transactions of the Association for ComputationalLinguistics 3419-432 2015

Vered Volansky Noam Ordan and Shuly Wintner On the features oftranslationese Digital Scholarship in the Humanities 30(1)98-118April 2015

Gennadi Lembersky Noam Ordan and Shuly Wintner ImprovingStatistical Machine Translation by Adapting Translation Models toTranslationese Computational Linguistics 39(4)999-1023 December2013

Gennadi Lembersky Noam Ordan and Shuly Wintner LanguageModels for Machine Translation Original vs Translated TextsComputational Linguistics 38(4)799-825 December 2012

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 107 112

Conclusion

BIBLIOGRAPHY I

Roee Aharoni Moshe Koppel and Yoav Goldberg Automatic detection of machine translated text and translation qualityestimation In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pages289ndash295 Baltimore Maryland June 2014 Association for Computational Linguistics URLhttpwwwaclweborganthologyP14-2048

Omar S Al-Shabab Interpretation and the language of translation creativity and conventions in translation JanusEdinburgh 1996

Ehud Alexander Avner Noam Ordan and Shuly Wintner Identifying translationese at the word and sub-word level DigitalScholarship in the Humanities 31(1)30ndash54 April 2016 doi httpdxdoiorg101093llcfqu047 URLhttpdxdoiorg101093llcfqu047

Mona Baker Corpus linguistics and translation studies Implications and applications In Mona Baker Gill Francis andElena Tognini-Bonelli editors Text and technology in honour of John Sinclair pages 233ndash252 John BenjaminsAmsterdam 1993

Marco Baroni and Silvia Bernardini A new approach to the study of Translationese Machine-learning the differencebetween original and translated text Literary and Linguistic Computing 21(3)259ndash274 September 2006 URLhttpllcoxfordjournalsorgcgicontentshort213259rss=1

Nitza Ben-Ari The ambivalent case of repetitions in literary translation Avoiding repetitions A ldquouniversalrdquo of translationMeta 43(1)68ndash78 1998

Shoshana Blum-Kulka Shifts of cohesion and coherence in translation In Juliane House and Shoshana Blum-Kulkaeditors Interlingual and intercultural communication Discourse and cognition in translation and second languageacquisition studies volume 35 pages 17ndash35 Gunter Narr Verlag 1986

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification Language Learning 28(2)399ndash416December 1978

Shoshana Blum-Kulka and Eddie A Levenston Universals of lexical simplification In Claus Faerch and Gabriele Kaspereditors Strategies in Interlanguage Communication pages 119ndash139 Longman 1983

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 108 112

Conclusion

BIBLIOGRAPHY II

Andrew Chesterman Beyond the particular In Anna Mauranen and Pekka Kujamaki editors Translation universals Dothey exist pages 33ndash50 John Benjamins 2004

Kenneth Ward Church and Patrick Hanks Word association norms mutual information and lexicography ComputationalLinguistics 16(1)22ndash29 1990 ISSN 0891-2017

Sauleh Eetemadi and Kristina Toutanova Asymmetric features of human generated translation In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) pages 159ndash164 Association forComputational Linguistics October 2014 URL httpwwwaclweborganthologyD14-1018

Martin Gellerstam Translationese in Swedish novels translated from English In Lars Wollin and Hans Lindquist editorsTranslation Studies in Scandinavia pages 88ndash95 CWK Gleerup Lund 1986

Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Peter Reutemann and Ian H Witten The WEKA datamining software an update SIGKDD Explorations 11(1)10ndash18 2009 ISSN 1931-0145 doi10114516562741656278 URL httpdxdoiorg10114516562741656278

Qiwei He Bernard P Veldkamp and Theo de Vries Screening for posttraumatic stress disorder using verbal features in selfnarratives A text mining approach Psychiatry research 198(3)441ndash447 08 2012 URLhttplinkinghubelseviercomretrievepiiS0165178112000625showall=true

Iustina Ilisei A Machine Learning Approach to the Identification of Translational Language An Inquiry intoTranslationese Learning Models PhD thesis University of Wolverhampton Wolverhampton UK February 2013 URLhttpclgwlvacukpapersilisei-thesispdf

Iustina Ilisei and Diana Inkpen Translationese traits in Romanian newspapers A machine learning approach InternationalJournal of Computational Linguistics and Applications 2(1-2) 2011

Iustina Ilisei Diana Inkpen Gloria Corpas Pastor and Ruslan Mitkov Identification of translationese A machine learningapproach In Alexander F Gelbukh editor Proceedings of CICLing-2010 11th International Conference onComputational Linguistics and Intelligent Text Processing volume 6008 of Lecture Notes in Computer Sciencepages 503ndash511 Springer 2010 ISBN 978-3-642-12115-9 URL httpdxdoiorg101007978-3-642-12116-6

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 109 112

Conclusion

BIBLIOGRAPHY III

Patrick Juola Authorship attribution Foundations and Trends in Information Retrieval 1(3)233ndash334 2006 URLhttpdxdoiorg1015611500000005

Dorothy Kenny Lexis and creativity in translation a corpus-based study St Jerome 2001 ISBN 9781900650397

Philipp Koehn Europarl A parallel corpus for statistical machine translation In Proceedings of the tenth MachineTranslation Summit pages 79ndash86 AAMT 2005 URL httpmt-archiveinfoMTS-2005-Koehnpdf

Philipp Koehn Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke CowanWade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst MosesOpen source toolkit for statistical machine translation In Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions pages 177ndash180Prague Czech Republic June 2007 Association for Computational Linguistics URLhttpwwwaclweborganthologyP07-2045

Moshe Koppel and Noam Ordan Translationese and its dialects In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics Human Language Technologies pages 1318ndash1326 Portland OregonUSA June 2011 Association for Computational Linguistics URL httpwwwaclweborganthologyP11-1132

Moshe Koppel Shlomo Argamon and Anat Rachel Shimoni Automatically categorizing written texts by author genderLiterary and Linguistic Computing 14(3) 2003

David Kurokawa Cyril Goutte and Pierre Isabelle Automatic detection of translated text and its impact on machinetranslation In Proceedings of MT-Summit XII pages 81ndash88 2009

Sara Laviosa Core patterns of lexical use in a comparable corpus of English lexical prose Meta 43(4)557ndash570 December1998

Sara Laviosa Corpus-based translation studies theory findings applications Approaches to translation studies Rodopi2002 ISBN 9789042014879

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing pages363ndash374 Edinburgh Scotland UK July 2011 Association for Computational Linguistics URLhttpwwwaclweborganthologyD11-1034

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 110 112

Conclusion

BIBLIOGRAPHY IV

Gennadi Lembersky Noam Ordan and Shuly Wintner Adapting translation models to translationese improves SMT InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 255ndash265 Avignon France April 2012a Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1026

Gennadi Lembersky Noam Ordan and Shuly Wintner Language models for machine translation Original vs translatedtexts Computational Linguistics 38(4)799ndash825 December 2012b URLhttpdxdoiorg101162COLI_a_00111

Gennadi Lembersky Noam Ordan and Shuly Wintner Improving statistical machine translation by adapting translationmodels to translationese Computational Linguistics 39(4)999ndash1023 December 2013 URLhttpdxdoiorg101162COLI_a_00159

Jianhua Lin Divergence measures based on the shannon entropy IEEE Transactions on Information Theory 37(1)145ndash151 January 1991 ISSN 0018-9448 doi 1011091861115 URL httpdxdoiorg1011091861115

Lin Oslashveras In search of the third code An investigation of norms in literary translation Meta 43(4)557ndash570 1998

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu BLEU A method for automatic evaluation of machinetranslation In ACL rsquo02 Proceedings of the 40th Annual Meeting on Association for Computational Linguisticspages 311ndash318 Morristown NJ USA 2002 Association for Computational Linguistics doihttpdxdoiorg10311510730831073135

Marius Popescu Studying translationese at the character level In Galia Angelova Kalina Bontcheva Ruslan Mitkov andNicolas Nicolov editors Proceedings of RANLP-2011 pages 634ndash639 2011

Rico Sennrich Perplexity minimization for translation model domain adaptation in statistical machine translation InProceedings of the 13th Conference of the European Chapter of the Association for Computational Linguisticspages 539ndash549 Avignon France April 2012 Association for Computational Linguistics URLhttpwwwaclweborganthologyE12-1055

Sanjun Sun and Gregory M Shreve Reconfiguring translation studies Unpublished manuscript 2013 URLhttpsanjunorgReconfiguringTShtml

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 111 112

Conclusion

BIBLIOGRAPHY V

Elke Teich Cross-Linguistic Variation in System and Text A Methodology for the Investigation of Translations andComparable Texts Mouton de Gruyter 2003

Joel Tetreault Daniel Blanchard and Aoife Cahill A report on the first native language identification shared task InProceedings of the Eighth Workshop on Building Educational Applications Using NLP Association forComputational Linguistics June 2013

Gideon Toury Interlanguage and its manifestations in translation Meta 24(2)223ndash231 1979

Gideon Toury In Search of a Theory of Translation The Porter Institute for Poetics and Semiotics Tel Aviv UniversityTel Aviv 1980

Gideon Toury Descriptive Translation Studies and beyond John Benjamins Amsterdam Philadelphia 1995

Yulia Tsvetkov Naama Twitto Nathan Schneider Noam Ordan Manaal Faruqui Victor Chahuneau Shuly Wintner andChris Dyer Identifying the L1 of non-native writers the CMU-Haifa system In Proceedings of the Eighth Workshopon Innovative Use of NLP for Building Educational Applications pages 279ndash287 Association for ComputationalLinguistics June 2013 URL httpwwwaclweborganthologyW13-1736

Hans van Halteren Source language markers in EUROPARL translations In Donia Scott and Hans Uszkoreit editorsCOLING 2008 22nd International Conference on Computational Linguistics Proceedings of the Conference 18-22August 2008 Manchester UK pages 937ndash944 2008 ISBN 978-1-905593-44-6 URLhttpwwwaclweborganthologyC08-1118

Ria Vanderauwerea Dutch novels translated into English the transformation of a lsquominorityrsquo literature RodopiAmsterdam 1985

Vered Volansky Noam Ordan and Shuly Wintner On the features of translationese Digital Scholarship in theHumanities 30(1)98ndash118 April 2015

copy Shuly Wintner (University of Haifa) Translationese 12 December 2016 112 112

Page 38: Translationese - Between Human and Machine Translation
Page 39: Translationese - Between Human and Machine Translation
Page 40: Translationese - Between Human and Machine Translation
Page 41: Translationese - Between Human and Machine Translation
Page 42: Translationese - Between Human and Machine Translation
Page 43: Translationese - Between Human and Machine Translation
Page 44: Translationese - Between Human and Machine Translation
Page 45: Translationese - Between Human and Machine Translation
Page 46: Translationese - Between Human and Machine Translation
Page 47: Translationese - Between Human and Machine Translation
Page 48: Translationese - Between Human and Machine Translation
Page 49: Translationese - Between Human and Machine Translation
Page 50: Translationese - Between Human and Machine Translation
Page 51: Translationese - Between Human and Machine Translation
Page 52: Translationese - Between Human and Machine Translation
Page 53: Translationese - Between Human and Machine Translation
Page 54: Translationese - Between Human and Machine Translation
Page 55: Translationese - Between Human and Machine Translation
Page 56: Translationese - Between Human and Machine Translation
Page 57: Translationese - Between Human and Machine Translation
Page 58: Translationese - Between Human and Machine Translation
Page 59: Translationese - Between Human and Machine Translation
Page 60: Translationese - Between Human and Machine Translation
Page 61: Translationese - Between Human and Machine Translation
Page 62: Translationese - Between Human and Machine Translation
Page 63: Translationese - Between Human and Machine Translation
Page 64: Translationese - Between Human and Machine Translation
Page 65: Translationese - Between Human and Machine Translation
Page 66: Translationese - Between Human and Machine Translation
Page 67: Translationese - Between Human and Machine Translation
Page 68: Translationese - Between Human and Machine Translation
Page 69: Translationese - Between Human and Machine Translation
Page 70: Translationese - Between Human and Machine Translation
Page 71: Translationese - Between Human and Machine Translation
Page 72: Translationese - Between Human and Machine Translation
Page 73: Translationese - Between Human and Machine Translation
Page 74: Translationese - Between Human and Machine Translation
Page 75: Translationese - Between Human and Machine Translation
Page 76: Translationese - Between Human and Machine Translation
Page 77: Translationese - Between Human and Machine Translation
Page 78: Translationese - Between Human and Machine Translation
Page 79: Translationese - Between Human and Machine Translation
Page 80: Translationese - Between Human and Machine Translation
Page 81: Translationese - Between Human and Machine Translation
Page 82: Translationese - Between Human and Machine Translation
Page 83: Translationese - Between Human and Machine Translation
Page 84: Translationese - Between Human and Machine Translation
Page 85: Translationese - Between Human and Machine Translation
Page 86: Translationese - Between Human and Machine Translation
Page 87: Translationese - Between Human and Machine Translation
Page 88: Translationese - Between Human and Machine Translation
Page 89: Translationese - Between Human and Machine Translation
Page 90: Translationese - Between Human and Machine Translation
Page 91: Translationese - Between Human and Machine Translation
Page 92: Translationese - Between Human and Machine Translation
Page 93: Translationese - Between Human and Machine Translation
Page 94: Translationese - Between Human and Machine Translation
Page 95: Translationese - Between Human and Machine Translation
Page 96: Translationese - Between Human and Machine Translation
Page 97: Translationese - Between Human and Machine Translation
Page 98: Translationese - Between Human and Machine Translation
Page 99: Translationese - Between Human and Machine Translation
Page 100: Translationese - Between Human and Machine Translation
Page 101: Translationese - Between Human and Machine Translation
Page 102: Translationese - Between Human and Machine Translation
Page 103: Translationese - Between Human and Machine Translation
Page 104: Translationese - Between Human and Machine Translation
Page 105: Translationese - Between Human and Machine Translation
Page 106: Translationese - Between Human and Machine Translation
Page 107: Translationese - Between Human and Machine Translation
Page 108: Translationese - Between Human and Machine Translation
Page 109: Translationese - Between Human and Machine Translation
Page 110: Translationese - Between Human and Machine Translation
Page 111: Translationese - Between Human and Machine Translation
Page 112: Translationese - Between Human and Machine Translation