ling 388: language and computers sandiway fong lecture 26: 11/29

LING 388: Language and Computers

Sandiway Fong

Lecture 26: 11/29

Administrivia

• Homework #5– due today

Homework 5: Question 1

• what have (specific senses) of the following nouns in common?– Umbrella– Saucepan– Baseball bat– Carpet beater

• but do not share with:– Giraffe– Pretzel– Homework


• compound nouns are present in wnconnect– baseball bat baseball_bat (‘_’ from Prolog representation)– carpet beater carpet_beater


• wnconnect is designed to look for links between two concepts– homework can be done this way– but perhaps better to explore with a WordNet browser, e.g. from Princeton

Last Time

• Internet search and language– information retrieval

• precision– what is the proportion of hits returned that are relevant?

• recall– what proportion of the true relevant answers are returned?

– stemming• pre-processing stage: find root forms of words• to expand search• can increase recall (perhaps at the expense of precision)• compromise: selective use of stemming from Google

Last Time

• Internet search and language– compounding

• stemming interacts with compounds: operating systems• compound identification is important for information

retrieval• semantics is more difficult

– can have compositional semantics: tea leaf, teabag, teabreak

– can be idiomatic: bootleg, marshmallow

• structural ambiguity: [computer furniture] design, computer [furniture design]

Today’s Topic

• Statistical Machine Translation (SMT)

Beginnings

c. 1950 (just after WWII)– electronic computers invented for

• numerical analysis• code breaking

• Book (Collection of Papers)• Readings in Machine Translation, Eds. Nirenburg, S. et al. MIT Press

2003. (Part 1: Historical Perspective)– Weaver, Reifer, Yngve, and Bar-Hillel …

Killer AppsKiller Apps: Language comprehension tasks and : Language comprehension tasks and Machine Translation (MT)Machine Translation (MT)

Basis in Cryptoanalysis?

• Success with computational methods and code-breaking

[Translation. Weaver, W.]• citing Shannon’s work, Weaver asks:

• “If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?”

Statistical Basis

• Popular in the early days and has undergone a modern revival

The Present Status of Automatic Translation of Languages (Bar-Hillel, 1951)– “I believe this overestimation is a remnant of the time, seven or eight years

ago, when many people thought that the statistical theory of communication would solve many, if not all, of the problems of communication”

• Bar-Hillel’s criticisms include– much valuable time spent on gathering statistics– no longer a bottleneck today

Statistical Basis

• Popular in the early days and has undergone a modern revival

Statistical Methods and Linguistics (Abney, 1996)

– Chomsky vs. Shannon• Statistics and low (zero) frequency items• Colorless green ideas sleep furiously vs. furiously sleep ideas green colorless• (lecture 22)• Modern answer: smoothing

• No relation between order of approximation and grammaticality– n-th order approximation reflecting degree of grammaticality as n increases

• Parameter estimation problem is intractable (for humans)– statistical models involve learning or estimating very large number of parameters– “we cannot seriously propose that a child learns the values of 109 parameters in a

childhood lasting only 108 seconds”– see IBM translation reference later (17 million parameters)

Early Misplaced Optimism

(Bar-Hillel, 1951)• Reifer (University of Washington)

– Unbelievably optimistic claims

– Compounding: – “found moreover that only three matching procedures and four matching steps are

necessary to deal effectively with any of these ten types of compounds of any language in which they occur”

– (compounding: problems, see lecture 25)

– [i.e. we have heuristics that we think work]– “it will not be very long before the remaining linguistic problems in machine translation

will be solved for a number of important languages”

Early Misplaced Optimism

• [Wiener]– “Basic English is the reverse of mechanical and throws upon such words as get a

burden which is much greater than most words carry”

• [Weaver]– Multiple meanings on get yes – but a limited number of two word combinations get up, get over, get back– 2000 words => 4 million two word combinations – not formidable to a “modern” (1947) computer

get is very polysemous WordNet (Miller, 1981) lists 36 senses

Re-emergence of the Statistical Basis

• Conditions are different now– Computers 105 times faster– There has been a data revolution

• Gigabytes of storage really cheap• Large, machine-readable corpora readily available for parameter estimation

Statistical MT

• Avoid the explicit construction of linguistically sophisticated models of grammar

– Not the only way: e.g. Example-based MT (EBMT)

• Pioneered by IBM researchers (Brown et al., 1990)–Language Model

•Pr(S) estimated by n-grams

–Translation Model•Pr(T|S) estimated through alignment models

N-grams

– we’ll talk about this more next time...• idea:

– collect statistics on co-occurrence of adjacent words• Brown corpus (1 million words):

– word w frequency(w) probability(w)

– the 69,971 0.070

– rabbit 11 0.000011

• example:– Just then, the white

– expectation is p(white rabbit) > p(white the)

– but p(the) > p(rabbit)

Statistical MT

• Parameter estimation by crunching large-scale corpora

• Hansard French/English parallel corpus– The Hansard Corpus consists of parallel texts in English and Canadian French,

drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches.

• (IBM’s experiment: 100 million words, est. 17 million parameters)

The State of the Art

www.languageweaver.comStatistical MT System [Spinoff from USC/ISI work]• “Language Weaver’s SMTS system is a significant

advancement in the state of the art for machine translation… and [we] are confident that Language Weaver has produced the most commercially viable Arabic translation system available today.”

• Metrics: performance determined by competition– common test and training data

1980sJapanese

1970s

1960sRussianW. European languages

present dayArabic

http://www.languageweaver.com/

Real Progress or Not?

• (2003) MT Summit IX. – Proceedings available online

• http://www.amtaweb.org/summit/MTSummit/• Interesting paper by J. Hutchins: Has machine translation

improved? Some historical comparisons.

“… overall there has been definite progress since the mid 1960s and very probably since the early 1970s. What is more uncertain is whether and where there have been improvements since the early 1980s.”

– Compared modern day systems against systems from the 1960s, 1970s (e.g. SYSTRAN) and 1980s

•Difficult: first systems are lost to us•Languages

–Russian to English–French to English–German to English

http://www.amtaweb.org/summit/MTSummit/


• http://babelfish.altavista.com/


[Hutchins, pp.7-8]• “The impediments to the improvement of translation quality

are the same now that they have been from the beginning:–failures of disambiguation–incorrect selection of target language words–problems with anaphora

•pronouns (it vs. she/he)•definite articles (e.g. when translating from Russian and French)

–inappropriate retention of source language structures •e.g. verb-initial constructions (from Russian) •verb-final placements (from German)•non-English pre-nominal participle constructions (e.g. with interest to be read materials from both Russian and German)

–problems of coordination–numerous and varied difficulties with prepositions–in general always problems with any multi-clause sentence”

Roughly echoes what Bar-Hillel said about 50 years earlier

Statistical vs. Traditional

• Which ones are commercially deployed?

– internet translators: traditional– new languages: statistical

ling 388: language and computers sandiway fong lecture 26: 11/29

Documents

beater slide

google slide

princeton slide

machine translation

giraffe pretzel homework

statistical basis popular

concepts homework

administrivia homework