what’s new in statistical machine translation
DESCRIPTION
What’s New in Statistical Machine Translation. USC/Information Sciences Institute USC/Computer Science Department. Kevin Knight. Machine Translation. 美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。. - PowerPoint PPT PresentationTRANSCRIPT
What’s New in Statistical Machine Translation
Kevin Knight USC/Information Sciences Institute
USC/Computer Science Department
Machine Translation
美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。
The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.
Thousands of Languages Are SpokenMANDARIN 885,000,000SPANISH 332,000,000ENGLISH 322,000,000BENGALI 189,000,000
HINDI 182,000,000PORTUGUESE 170,000,000RUSSIAN 170,000,000JAPANESE 125,000,000GERMAN 98,000,000
WU (China) 77,175,000JAVANESE 75,500,800KOREAN 75,000,000FRENCH 72,000,000VIETNAMESE 67,662,000
TELUGU 66,350,000YUE (China) 66,000,000MARATHI 64,783,000TAMIL 63,075,000
TURKISH 59,000,000URDU 58,000,000MIN NAN (China) 49,000,000JINYU (China) 45,000,000
GUJARATI 44,000,000POLISH 44,000,000ARABIC 42,500,000UKRAINIAN 41,000,000
ITALIAN 37,000,000XIANG (China) 36,015,000MALAYALAM 34,022,000HAKKA (China) 34,000,000
KANNADA 33,663,000ORIYA 31,000,000PANJABI 30,000,000SUNDA 27,000,000
Source: Ethnologue
Recent Progress in Statistical MT
insistent Wednesday may recurred her trips to Libya tomorrow for flying
Cairo 6-4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment.
Egyptair Has Tomorrow to Resume Its Flights to Libya
Cairo 4-6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
20022002 20032003
newsbroadcast
foreignlanguagespeechrecognition
Englishtranslation
searchablearchive
20052005
Warren Weaver (1947)
ingcmpnqsnwf cv fpn owoktvcv
hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv ...
Warren Weaver (1947)
e e e e ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
Warren Weaver (1947)
e e e the ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
Warren Weaver (1947)
e he e the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
Warren Weaver (1947)
e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
Warren Weaver (1947)
e he e of the fofingcmpnqsnwf cv fpn owoktvcv e f o e o oe thu ihgzsnwfv rqcffnw cw owgcnwf efkowazoanv ...
Warren Weaver (1947)
e he e of theingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
Warren Weaver (1947)
e he e is the sisingcmpnqsnwf cv fpn owoktvcv e s i e i ie thu ihgzsnwfv rqcffnw cw owgcnwf eskowazoanv ...
Warren Weaver (1947)
decipherment is the analysisingcmpnqsnwf cv fpn owoktvcvof documents written in ancienthu ihgzsnwfv rqcffnw cw owgcnwflanguages ...kowazoanv ...
Warren Weaver (1947)
The non-Turkish guy next to me is even deciphering Turkish! All he needs is a
statistical table of letter-pair frequencies in Turkish …
Collected mechanically from aTurkish body of text, or corpus
Can this becomputerized?
“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947
“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947
“... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague ... to make any quasi-mechanical translation scheme very hopeful.”
- Norbert Wiener, April 1947
Spanish/English corpus
1a. Garcia and associates .1b. Garcia y asociados .
7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .8b. la empresa tiene tres grupos .
3a. his associates are not strong .3b. sus asociados no son fuertes .
9a. its groups are in Europe .9b. sus grupos estan en Europa .
4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .
6a. the associates are also angry .6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
Spanish/English corpus
1a. Garcia and associates .1b. Garcia y asociados .
7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .8b. la empresa tiene tres grupos .
3a. his associates are not strong .3b. sus asociados no son fuertes .
9a. its groups are in Europe .9b. sus grupos estan en Europa .
4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .
6a. the associates are also angry .6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
Translate: Clients do not sell pharmaceuticals in Europe.
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
process ofelimination
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
cognate?
Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
zerofertility
“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947
The required statistical tables have millions of entries…?
Too much for the computers of Weaver’s day.
Not enough RAM!
IBM Candide Project (1988-1994)
• How to get quantities of human translation in computer readable form?– parallel corpus
Canadianbureaucrat
IBM’s John Cocke,inventor of CKY parsing & RISC processors
IBM Candide Project (1988-1994)
• How to get quantities of human translation in computer readable form?– parallel corpus
Canadianbureaucrat
IBM’s John Cocke,inventor of CKY parsing & RISC processors
IBM Candide Project (1988-1994)
• How to get quantities of human translation in computer readable form?– parallel corpus
IBM’s John Cocke,inventor of CKY parsing & RISC processors
IBM Candide Project[Brown et al 93]
French BrokenEnglish
English
French/EnglishBilingual Text
EnglishText
Statistical Analysis Statistical Analysis
J’ ai si faim
What hunger have I,Hungry I am so,I am so hungry,Have me that hunger …
I am so hungry
Mathematical Formulation
J’ ai si faim I am so hungry
TranslationModel P(f | e)
LanguageModel P(e)
Decoding algorithm
argmaxe P(e) · P(f | e)
Given source sentence f:
argmaxe P(e | f) =
argmaxe P(f | e) · P(e) / P(f) = by Bayes Rule
argmaxe P(f | e) · P(e) P(f) same for all e
French BrokenEnglish
English
Language Modeling
Goal of a language model for MT:
He is on the soccer fieldHe is in the soccer field
Is table the on cup theThe cup is on the table
American shrineAmerican company
Need to make thesedecisions, becausetranslation model maynot have a lot of context information!
The Classic Language ModelWord Bigrams
Process model of English: Generate each word based only on the previous word.
P(I saw water on the table) =
P(I | START) ·P(saw | I) ·P(water | saw) ·P(on | water) ·P(the | on) ·P(table | the) ·P(END | table)
Probabilities can be tabulatedfrom an online English corpus …just like Weaver’s Turkish case.
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking
[Soricut & Marcu, 05]
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking
the banking trustco is said to expand its purchase part of its royal international plan operations
[Soricut & Marcu, 05]
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations
the banking trustco is said to expand its purchase part of its royal international plan operations
[Soricut & Marcu, 05]
N-grams have a lot of semantics in them!
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations
[Soricut & Marcu, 05]
the banking trustco is said to expand its purchase part of its royal international plan operations
with the stressed relationship part
own longstanding its its for chinese boeing , ,
for its part, stressed the longstanding relationship with its own, chinese boeing
boeing, for its part, stressed its own longstanding relationship with the chinese
Translation Model?
Mary did not slap the green witch
Maria no dió una bofetada a la bruja verde
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
Process model of translation:
Translation Model?
Mary did not slap the green witch
Maria no dió una bofetada a la bruja verde
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
What are allthe possiblemoves andwhat probabilitytables controlthose moves?
Process model of translation:
The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]
Mary did not slap the green witch
Mary not slap slap slap the green witch
Maria no dió una bofetada a la bruja verde
Mary not slap slap slap NULL the green witch
Maria no dió una bofetada a la verde bruja
Trainable
Process model of translation:
n(3|slap) 50k entries
d(j|i) 2500 entries
P-Null 1 entry
t(la|the) 25m entries
The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]
Mary did not slap the green witch
Maria no dió una bofetada a la bruja verde
Still trainable!
Process model of translation:
n(3|slap) 50k entries
d(j|i) 2500 entries
P-Null 1 entry
t(la|the) 25m entries
?
Classic Formula for P(f | e)
P(f | e) =
m – Φ0 lΣ ( ) · P-Null m – 2Φ0 · (1-P-Null) Φ0 · Φi! · (1 /
Φ0!) ·a Φ0 i=0
l m m n(Φi | ei) · t(fj | eaj) · d(j | aj, l, m) i=1 j=1 j:a j <> 0
fertility word translation re-ordering
NULL stuff
Set parameter values so formula assigns the highest possible probability to observed human translations. This is a 25m-dimensional search space.
sum overalignment
possibilities
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
All P(french-word | english-word) equally likely
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“la” and “the” observed to co-occur frequently,so P(la | the) is increased.
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“maison” co-occurs with both “the” and “house”, butP(maison | house) can be raised without limit, to 1.0,
while P(maison | the) is limited because of “la”
(pigeonhole principle)
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
settling down after another iteration
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
Inherent hidden structure revealed by EM training!
• “A Statistical MT Tutorial Workbook” (Knight, 1999). Promises free beer.
• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)
• Software: GIZA++
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
Translation Model
Sample Translation Probabilities
[Brown et al 93]
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
new Frenchsentence f
Translation Model
potentialtranslation e
P(f | e)
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
new Frenchsentence f
Translation Model
Language Model
w1 w2 P(w2 | w1)
of
the 0.13
a 0.09
another 0.01
some 0.01
hong
kong 0.98
said 0.01
stated 0.01
potentialtranslation e
P(f | e) P(e)
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
new Frenchsentence f
P(f | e) · P(e) score for e
Translation Model
Language Model
w1 w2 P(w2 | w1)
of
the 0.13
a 0.09
another 0.01
some 0.01
hong
kong 0.98
said 0.01
stated 0.01
potentialtranslation e
P(f | e) P(e)
Search for Best Translation
voulez – vous vous taire !
Search for Best Translation
voulez – vous vous taire !
you – you you quiet !
Search for Best Translation
voulez – vous vous taire !
you – you quiet !
Search for Best Translation
voulez – vous vous taire !
quiet you – you you !
Search for Best Translation
voulez – vous vous taire !
shut you – you you !
Search for Best Translation
voulez – vous vous taire !
you shut !
Search for Best Translation
voulez – vous vous taire !
you shut up !
Classic Decoding Algorithm
Given f, find the English string e that maximizes P(e) · P(f | e)
NP-Complete [Knight 99].
Brown et al 93:“In this paper, we focus on the
translation modeling problem. We hope to deal with the [decoding] problem in a later paper.”
Beam Search Decoding[Brown et al US Patent #5,477,451]
1st Englishword
2nd Englishword
3rd Englishword
4th Englishword
start end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
all sourcewords
covered
[Jelinek 69; Och, Ueffing, and Ney, 01]
1st Englishword
2nd Englishword
3rd Englishword
4th Englishword
start end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
all sourcewords
covered
best predecessorlink
[Jelinek 69; Och, Ueffing, and Ney, 01]
Beam Search Decoding[Brown et al US Patent #5,477,451]
Classic Results• nous avons signé le protocole . (Foreign Original)• we did sign the memorandum of agreement . (Human Translation)• we have signed the protocol . (MT)
• où était le plan solide ? (Foreign Original)• but where was the solid plan ? (Human Translation)• where was the economic base ? (MT)
the Ministry of Foreign Trade and Economic Cooperation, including foreigndirect investment 40 billion US dollars today provide data includethat year to November china actually using foreign 46.959 billion US dollars and
very slow = one page per day
Okay!
I know, so far, this talk should be called …
What’s Old in Statistical
Machine Translation!!
Further Developments
• Follow-on projects– Hong Kong– Aachen– Behavior Design Corporation
• JHU Summer Workshop 1999– Build & distribute statistical MT tools– Create standard training & testing data– Disseminate tutorial material– “MT in a Day”– Ask new questions
How Much Data Do We Need?
Amount of bilingual training data
Quality of automatically trainedmachine translationsystem
Advances in Statistical MT2000-2004
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
+ European parliament data [Koehn 05]
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
BLEU Evaluation Metric(Papineni et al 02)
• N-gram precision (score is between 0 & 1)
What percentage of machine n-grams can be found in the reference translation?
Gross measure over 1000 test sentences.Not allowed to use same portion of reference
translation twice (can’t cheat by typing out “the the the the the”)
Brevity penalty: can’t just type out single word “the” (and get precision 1.0)
BLEU in Action
枪手被警方击毙。 (Foreign Original)
the gunman was shot to death by the police . (Reference Translation)
the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10
BLEU in Action
枪手被警方击毙。 (Foreign Original)
the gunman was shot to death by the police . (Reference Translation)
the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10
green = 4-gram match (good!)red = word not matched (bad!)
BLEU Tends to Predict Human Judgments
R2 = 88.0%
R2 = 90.2%
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Human Judgments
BL
EU
Sc
ore
Adequacy
Fluency
Linear(Adequacy)Linear(Fluency)
slide from G. Doddington (NIST)
Experiment-Driven Progress
Mar1
Apr1
15
20
25
30
35
May1
BLEU
2005
ISI Syntax-Based MTChinese/EnglishNIST 2002 Test Set
Evaluate new MT research ideas every day!(and be alerted about bugs…)
Draw Learning Curves
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
10k 20k 40k 80k 160k 320k
Swedish/EnglishFrench/EnglishGerman/EnglishFinnish/English
# of sentence pairs used in training
BLEUscore
Experiments byPhilipp Koehn
Flaws of Word-Based MT
• Can’t translate multiple English words to one French word
• Can’t translate phrases– “real estate”, “note that”, “interest in”
• Isn’t sensitive to syntax– Adjectives/nouns should swap order– Verb comes at the beginning in Arabic
• Doesn’t understand the meaning (?)
The MT Triangle
SOURCE TARGET
words words
syntax syntax
logical form
interlingua
logical form
The MT Swimming Pool
syntax syntax
interlingua
words words
logical form logical form
SOURCE TARGET
Commercial Rule-Based Systems
words words
syntax syntax
logical form
interlingua
logical form
Knight et al 95 - meaning-based translation - composition rules
Language Model
SOURCE TARGET
words words
syntax syntax
logical form
interlingua
logical form
Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text
SOURCE TARGET
Language Modelwords words
syntax syntax
logical form
interlingua
logical form
Yamada/Knight (01,02) - tree/string model - used existing target language parser
SOURCE TARGET
Language Modelwords words
syntax syntax
logical form
interlingua
logical form
Well, these all seem like good ideas.
Which one had the most dramatic effect on MT quality?
None of them!
Phrases
SOURCE TARGET
phrases phrases
How do you translate“real estate” into French?
real estate real number dance number dance card memory card memory stick …
words words
syntax syntax
logical form
interlingua
logical form
Phrase-Based Statistical MT
• Foreign input segmented into phrases– “phrase” just means “word sequence”
• Each phrase is probabilistically translated into English– P(to the conference | zur Konferenz)– P(into the meeting | zur Konferenz)
• Phrases are probabilistically re-ordered
See [Koehn et al, 2003] for an overview.
Morgen fliege ich nach Kanada zur Konferenz
Tomorrow I will fly to the conference In Canada
How to Learn the Phrase Translation Table?
• One method: “alignment templates” [Och et al 99]
• Start with word alignment
• Collect all phrase pairs that are consistent with the word alignment
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)
Phrase Pair Probabilities
• A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus.
• No EM training
• Just relative frequency:
count(f-f-f, e-e-e) P(f-f-f | e-e-e) = ----------------------- count(e-e-e)
Phrase-Based MT
• This is currently the best way to do Statistical MT!
• What took so long to move from words to phrases?– Missing RAM
• 25m parameters billions of parameters• Trick idea: build test-corpus-specific phrase table (takes 5 hours!)• Now solved in commercial deployments
– Missing computing power– Many competing ideas to shake out
• Koehn 03 summarizes several variations– Empirical effectiveness even better than intuition would predict
• This is not building a ladder to the moon!– If you can’t translate “real estate” into French, you are sunk
Advanced Training Methods
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e) x P(f | e) e
Advanced Training Methods
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e)2.4 x P(f | e) … works better! e
Advanced Training Methods
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e)2.4 x P(f | e) x length(e)1.1
eRewards longer hypotheses, since these are unfairly punished by P(e)
Advanced Training Methods
argmax P(e)2.4 x P(f | e) x length(e)1.1 x FEAT 3.7 …
e
Lots of features vote on every potential translation.
Exponential model.
Problem: How to set the exponent weights?
IDEA 1: maximize probability of the dataIDEA 2: maximize BLEU score of MT system
20.64% BLEU
WTM fixed at 1.0
17.96% BLEU
plot by Emil Ettelaie
Maximum BLEU Training
• Novel algorithm developed by [Och 03]
• Opened gates to “feature hacking”– Word-based feature to smooth phrase pair
counts (“Model1 Inverse”)– Phrase-specific propensities to re-order
• Currently limited to ~25 features
Advances in Statistical MT2005
Google’s Language Model
• Previously, largest language model was trained on 1b words of English
• 20b words of news significant impact on news translation
• 200b words of web helpful
Maryland’s Hiero system [Chiang 05]
• Previously:– ne mange pas does not eat
• New phrase pairs with variables and reordering– ne X pas does not X
– le X1 du X2 X2 's X1
• Nesting– “does not X” itself becomes an X
• CKY decoderJohnCocke
ISI’s Syntax-Based MT System
• First strong showing for an SMT system that knows what nouns and verbs are!
• Why syntax?“Frequent high-tech exports are bright spots for
foreign trade growth of Guangdong has made important contributions.”
– Need much more grammatical output– Need accurate control over re-ordering– Need accurate insertion of function words
The gunman killed by police .
String Output
. 击毙警方被枪手
The gunman killed by police .
DT NN VBD IN NN
NPB PP
NP-C VP
S
Tree Output
. 击毙警方被枪手
Gunman by police shot .
NN IN NN VBD
NPB PP
NP-C VP
S
Tree Output
. 击毙警方被枪手
The gunman was killed by police .
DT NN AUX VBN IN NN
NPB PP
NP-C VP
S
Tree Output
. 击毙警方被枪手
Sample Rules Learned from Data
"说 " "," x0 0.57
"说 " x0 0.09
"他 " "说 " "," x0 0.02
"指出 " "," x0 0.02
x0 0.02
VP
VB SBAR
IN x0:S
that
said
NP
x0:NP PP
IN x1:NP
from
x1 x0 0.27
"来自 " x1 x0 0.15
x1 "的 " x0 0.06
"从 " x1 x0 0.06
"来自 " x1 "的 " x0 0.06
Sample Rules Learned from Data
S
x0:NP VP
x1:VB x2:NP
x0 x1 x2 0.82
x0 x1 "," x2 0.02
x0 x1 x2 0.54 x1 x0 x2 0.44
S
x0:NP VP
x1:VB x2:NP
(Chinese/English)
(Arabic/ English)
subject-verb inversion
Format is Expressive
S
x0:NP VP
x1:VB x2:NP2
x1, x0, x2
S
PRO VP
VB x0:NPthere
are
hay, x0
NP
x0:NP PP
of
P x1:NP
x1, , x0
Multilevel Re-Ordering
Non-constituent Phrases
Lexicalized Re-Ordering
VP
VBZ VBG
is
está, cantando
Phrasal Translation
singing
VP
VB x0:NP PRT
put
poner, x0
Non-contiguous Phrases
on
NPB
DT x0:NNS
the
x0
Context-SensitiveWord Insertion
[Knight & Graehl, 2005]
Story Gets More Interesting…
Applications
MT
Automata Theory
TreeTransducers(Rounds 70)
Story Gets More Interesting…
Applications
MT
Automata Theory
TreeTransducers(Rounds 70)
Linguistic Theory
TransformationalGrammar
(Chomsky 57)
Story Gets More Interesting…
Applications
MT (05)
Automata Theory
TreeTransducers(Rounds 70)
Compression (01)
QA (03)
Generation (00)
Linguistic Theory
TransformationalGrammar
(Chomsky 57)
Story Gets More Interesting…
Applications
Algorithms
Linguistic Theory
Automata Theory
TransformationalGrammar
(Chomsky 57)
TreeTransducers(Rounds 70)
Efficient TransducerAlgorithms
Generic TreeToolkits
MT (05)
Compression (01)
QA (03)
Generation (00)
Summary
• Making good progress
• Algorithms + Data + Evaluation + Computers
• Interdisciplinary work– Natural language processing– Machine learning– Linguistics– Automata theory
• Hope that more people will join!
Thank you
Syntax-Based vs Phrase-Based
Mar1
Apr1
15
20
25
30
35
May1
BLEUphrase-based system
2005
Chinese/EnglishNIST 2002 Test Set
Future PhD Theses?
“Syntax-based Language Models for Improving Statistical MT”
“Discriminative Training of Millions of Features for MT”
“Semantic Representations Induced from Multilingual EU and UN Data”
“What Makes One Language Pair More Difficult to Translate Than Another”
“A State-of-the-Art MT System Based on Syntactic Transformations”
“New Training Methods for High-Quality Word Alignment”
+ many unpredictable ones…
Summary
• Phrase-based models are state-of-the-art– Word alignments– Phrase pair extraction & probabilities– N-gram language models– Beam search decoding– Feature functions & learning weights
• But the output is not English– Fluency must be improved– Better translation of person names, organizations, locations– More automatic acquisition of parallel data, exploitation of
monolingual data across a variety of domains/languages– Need good accuracy across a variety of domains/languages
Available Resources• Bilingual corpora
– 100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu)– Lots of French/English, Spanish/French/English, LDC– European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI
• (www.isi.edu/~koehn/publications/europarl)– 20m words (sentence-aligned) of English/French, Ulrich Germann, ISI
• (www.isi.edu/natural-language/download/hansard/)
• Sentence alignment– Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm)– Xiaoyi Ma, LDC (Champollion)
• Word alignment– GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/)– GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html)– Manually word-aligned test corpus (500 French/English sentence pairs), RWTH
Aachen– Shared task, NAACL-HLT’03 workshop
• Decoding– ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/)– ISI Pharoah phrase-based decoder
• Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/)
• Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)
Some Papers Referenced on Slides• ACL
– [Och, Tillmann, & Ney, 1999]– [Och & Ney, 2000]– [Germann et al, 2001]– [Yamada & Knight, 2001, 2002]– [Papineni et al, 2002]– [Alshawi et al, 1998]– [Collins, 1997]– [Koehn & Knight, 2003]– [Al-Onaizan & Knight, 2002]– [Och & Ney, 2002]– [Och, 2003]– [Koehn et al, 2003]
• EMNLP– [Marcu & Wong, 2002]– [Fox, 2002]– [Munteanu & Marcu, 2002]
• AI Magazine– [Knight, 1997]
• www.isi.edu/~knight– [MT Tutorial Workbook]
• AMTA– [Soricut et al, 2002]– [Al-Onaizan & Knight, 1998]
• EACL– [Cmejrek et al, 2003]
• Computational Linguistics– [Brown et al, 1993]– [Knight, 1999]– [Wu, 1997]
• AAAI– [Koehn & Knight, 2000]
• IWNLG– [Habash, 2002]
• MT Summit– [Charniak, Knight, Yamada, 2003]
• NAACL– [Koehn, Marcu, Och, 2003]– [Germann, 2003]– [Graehl & Knight, 2004]– [Galley, Hopkins, Knight, Marcu, 2004]
Ready-to-Use Online Bilingual Data
0
20
40
60
80
100
120
140
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
+ 1m-20m words formany language pairs
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
Millions of words(English side)
One Billion?
???
From No Data to Sentence Pairs
• Easy way: Linguistic Data Consortium (LDC)• Really hard way: pay $$$
– Suppose one billion words of parallel data were sufficient– At 20 cents/word, that’s $200 million
• Pretty hard way: Find it, and then earn it!– De-formatting– Remove strange characters– Character code conversion– Document alignment– Sentence alignment– Tokenization (also called Segmentation)
Sentence Alignment
The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.
El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.
Sentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
Sentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
Sentence Alignment
1. The old man is happy. He has fished many times.
2. His wife talks to him.
3. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
Note that unaligned sentences are thrown out, andsentences are merged in n-to-m alignments (n, m > 0).
Tokenization (or Segmentation)
• English– Input (some byte stream):
"There," said Bob.– Output (7 “tokens” or “words”): " There , " said Bob .
• Chinese– Input (byte stream):
– Output:
美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件。
美国 关岛国 际机 场 及其 办公 室均接获 一名 自称 沙地 阿拉 伯
富 商拉登 等发 出 的 电子邮件。
Lower-Casing
• English– Input (7 words):
" There , " said Bob .
– Output (7 words):
" there , " said bob .
Thethe“The“the
theSmaller vocabulary size.More robust counting and learning.
Idea of tokenizing and lower-casing:
Recent Progress in Statistical MT
• Why is that?– Better algorithms that learn patterns from data– More data– Faster, cheaper computers with more RAM– Community-wide test sets– Novel automated evaluation methods– Shared software tools
Three Problems for Statistical MT
• Translation model– Given a pair of strings <f,e>, assigns P(f | e) by formula– <f,e> look like translations high P(f | e)– <f,e> don’t look like translations low P(f | e)
• Language model– Given an English string e, assigns P(e) by formula– good English string high P(e)– random word sequence low P(e)
• Decoding algorithm– Given a language model, a translation model, and a new
sentence f … find translation e maximizing P(e) · P(f | e)
Web Language Models
She has a lot of nerve. [20]French input It has a lot of nerve. [3]
?
Used by Google in 2005 to increase performance of their research MT system!
[Soricut, Knight, Marcu, 02]