what’s new in statistical machine translation

140
What’s New in Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department

Upload: miller

Post on 13-Jan-2016

70 views

Category:

Documents


0 download

DESCRIPTION

What’s New in Statistical Machine Translation. USC/Information Sciences Institute USC/Computer Science Department. Kevin Knight. Machine Translation. 美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What’s New in  Statistical Machine Translation

What’s New in Statistical Machine Translation

Kevin Knight USC/Information Sciences Institute

USC/Computer Science Department

Page 2: What’s New in  Statistical Machine Translation

Machine Translation

美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。

The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.

Page 3: What’s New in  Statistical Machine Translation

Thousands of Languages Are SpokenMANDARIN 885,000,000SPANISH 332,000,000ENGLISH 322,000,000BENGALI 189,000,000

HINDI 182,000,000PORTUGUESE 170,000,000RUSSIAN 170,000,000JAPANESE 125,000,000GERMAN 98,000,000

WU (China) 77,175,000JAVANESE 75,500,800KOREAN 75,000,000FRENCH 72,000,000VIETNAMESE 67,662,000

TELUGU 66,350,000YUE (China) 66,000,000MARATHI 64,783,000TAMIL 63,075,000

TURKISH 59,000,000URDU 58,000,000MIN NAN (China) 49,000,000JINYU (China) 45,000,000

GUJARATI 44,000,000POLISH 44,000,000ARABIC 42,500,000UKRAINIAN 41,000,000

ITALIAN 37,000,000XIANG (China) 36,015,000MALAYALAM 34,022,000HAKKA (China) 34,000,000

KANNADA 33,663,000ORIYA 31,000,000PANJABI 30,000,000SUNDA 27,000,000

Source: Ethnologue

Page 4: What’s New in  Statistical Machine Translation

Recent Progress in Statistical MT

insistent Wednesday may recurred her trips to Libya tomorrow for flying

Cairo 6-4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment.

Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

20022002 20032003

Page 5: What’s New in  Statistical Machine Translation

newsbroadcast

foreignlanguagespeechrecognition

Englishtranslation

searchablearchive

20052005

Page 6: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

ingcmpnqsnwf cv fpn owoktvcv

hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv ...

Page 7: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

e e e e ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Page 8: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

e e e the ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Page 9: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

e he e the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Page 10: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Page 11: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

e he e of the fofingcmpnqsnwf cv fpn owoktvcv e f o e o oe thu ihgzsnwfv rqcffnw cw owgcnwf efkowazoanv ...

Page 12: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

e he e of theingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Page 13: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

e he e is the sisingcmpnqsnwf cv fpn owoktvcv e s i e i ie thu ihgzsnwfv rqcffnw cw owgcnwf eskowazoanv ...

Page 14: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

decipherment is the analysisingcmpnqsnwf cv fpn owoktvcvof documents written in ancienthu ihgzsnwfv rqcffnw cw owgcnwflanguages ...kowazoanv ...

Page 15: What’s New in  Statistical Machine Translation

Warren Weaver (1947)

The non-Turkish guy next to me is even deciphering Turkish! All he needs is a

statistical table of letter-pair frequencies in Turkish …

Collected mechanically from aTurkish body of text, or corpus

Can this becomputerized?

Page 16: What’s New in  Statistical Machine Translation

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

- Warren Weaver, March 1947

Page 17: What’s New in  Statistical Machine Translation

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

- Warren Weaver, March 1947

“... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague ... to make any quasi-mechanical translation scheme very hopeful.”

- Norbert Wiener, April 1947

Page 18: What’s New in  Statistical Machine Translation

Spanish/English corpus

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

Page 19: What’s New in  Statistical Machine Translation

Spanish/English corpus

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

Translate: Clients do not sell pharmaceuticals in Europe.

Page 20: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 21: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 22: What’s New in  Statistical Machine Translation

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Page 23: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 24: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 25: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 26: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 27: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 28: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 29: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

process ofelimination

Page 30: What’s New in  Statistical Machine Translation

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

cognate?

Page 31: What’s New in  Statistical Machine Translation

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

Page 32: What’s New in  Statistical Machine Translation

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

- Warren Weaver, March 1947

The required statistical tables have millions of entries…?

Too much for the computers of Weaver’s day.

Not enough RAM!

Page 33: What’s New in  Statistical Machine Translation

IBM Candide Project (1988-1994)

• How to get quantities of human translation in computer readable form?– parallel corpus

Canadianbureaucrat

IBM’s John Cocke,inventor of CKY parsing & RISC processors

Page 34: What’s New in  Statistical Machine Translation

IBM Candide Project (1988-1994)

• How to get quantities of human translation in computer readable form?– parallel corpus

Canadianbureaucrat

IBM’s John Cocke,inventor of CKY parsing & RISC processors

Page 35: What’s New in  Statistical Machine Translation

IBM Candide Project (1988-1994)

• How to get quantities of human translation in computer readable form?– parallel corpus

IBM’s John Cocke,inventor of CKY parsing & RISC processors

Page 36: What’s New in  Statistical Machine Translation

IBM Candide Project[Brown et al 93]

French BrokenEnglish

English

French/EnglishBilingual Text

EnglishText

Statistical Analysis Statistical Analysis

J’ ai si faim

What hunger have I,Hungry I am so,I am so hungry,Have me that hunger …

I am so hungry

Page 37: What’s New in  Statistical Machine Translation

Mathematical Formulation

J’ ai si faim I am so hungry

TranslationModel P(f | e)

LanguageModel P(e)

Decoding algorithm

argmaxe P(e) · P(f | e)

Given source sentence f:

argmaxe P(e | f) =

argmaxe P(f | e) · P(e) / P(f) = by Bayes Rule

argmaxe P(f | e) · P(e) P(f) same for all e

French BrokenEnglish

English

Page 38: What’s New in  Statistical Machine Translation

Language Modeling

Goal of a language model for MT:

He is on the soccer fieldHe is in the soccer field

Is table the on cup theThe cup is on the table

American shrineAmerican company

Need to make thesedecisions, becausetranslation model maynot have a lot of context information!

Page 39: What’s New in  Statistical Machine Translation

The Classic Language ModelWord Bigrams

Process model of English: Generate each word based only on the previous word.

P(I saw water on the table) =

P(I | START) ·P(saw | I) ·P(water | saw) ·P(on | water) ·P(the | on) ·P(table | the) ·P(END | table)

Probabilities can be tabulatedfrom an online English corpus …just like Weaver’s Turkish case.

Page 40: What’s New in  Statistical Machine Translation

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking

[Soricut & Marcu, 05]

Page 41: What’s New in  Statistical Machine Translation

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking

the banking trustco is said to expand its purchase part of its royal international plan operations

[Soricut & Marcu, 05]

Page 42: What’s New in  Statistical Machine Translation

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations

the banking trustco is said to expand its purchase part of its royal international plan operations

[Soricut & Marcu, 05]

N-grams have a lot of semantics in them!

Page 43: What’s New in  Statistical Machine Translation

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations

[Soricut & Marcu, 05]

the banking trustco is said to expand its purchase part of its royal international plan operations

with the stressed relationship part

own longstanding its its for chinese boeing , ,

for its part, stressed the longstanding relationship with its own, chinese boeing

boeing, for its part, stressed its own longstanding relationship with the chinese

Page 44: What’s New in  Statistical Machine Translation

Translation Model?

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Source-language morphological analysis

Source parse tree

Semantic representation

Generate target structure

Process model of translation:

Page 45: What’s New in  Statistical Machine Translation

Translation Model?

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Source-language morphological analysis

Source parse tree

Semantic representation

Generate target structure

What are allthe possiblemoves andwhat probabilitytables controlthose moves?

Process model of translation:

Page 46: What’s New in  Statistical Machine Translation

The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]

Mary did not slap the green witch

Mary not slap slap slap the green witch

Maria no dió una bofetada a la bruja verde

Mary not slap slap slap NULL the green witch

Maria no dió una bofetada a la verde bruja

Trainable

Process model of translation:

n(3|slap) 50k entries

d(j|i) 2500 entries

P-Null 1 entry

t(la|the) 25m entries

Page 47: What’s New in  Statistical Machine Translation

The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Still trainable!

Process model of translation:

n(3|slap) 50k entries

d(j|i) 2500 entries

P-Null 1 entry

t(la|the) 25m entries

?

Page 48: What’s New in  Statistical Machine Translation

Classic Formula for P(f | e)

P(f | e) =

m – Φ0 lΣ ( ) · P-Null m – 2Φ0 · (1-P-Null) Φ0 · Φi! · (1 /

Φ0!) ·a Φ0 i=0

l m m n(Φi | ei) · t(fj | eaj) · d(j | aj, l, m) i=1 j=1 j:a j <> 0

fertility word translation re-ordering

NULL stuff

Set parameter values so formula assigns the highest possible probability to observed human translations. This is a 25m-dimensional search space.

sum overalignment

possibilities

Page 49: What’s New in  Statistical Machine Translation

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

All P(french-word | english-word) equally likely

Page 50: What’s New in  Statistical Machine Translation

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“la” and “the” observed to co-occur frequently,so P(la | the) is increased.

Page 51: What’s New in  Statistical Machine Translation

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“maison” co-occurs with both “the” and “house”, butP(maison | house) can be raised without limit, to 1.0,

while P(maison | the) is limited because of “la”

(pigeonhole principle)

Page 52: What’s New in  Statistical Machine Translation

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

settling down after another iteration

Page 53: What’s New in  Statistical Machine Translation

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

Inherent hidden structure revealed by EM training!

• “A Statistical MT Tutorial Workbook” (Knight, 1999). Promises free beer.

• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)

• Software: GIZA++

Page 54: What’s New in  Statistical Machine Translation

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

Translation Model

Sample Translation Probabilities

[Brown et al 93]

Page 55: What’s New in  Statistical Machine Translation

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

new Frenchsentence f

Translation Model

potentialtranslation e

P(f | e)

Page 56: What’s New in  Statistical Machine Translation

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

new Frenchsentence f

Translation Model

Language Model

w1 w2 P(w2 | w1)

of

the 0.13

a 0.09

another 0.01

some 0.01

hong

kong 0.98

said 0.01

stated 0.01

potentialtranslation e

P(f | e) P(e)

Page 57: What’s New in  Statistical Machine Translation

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

new Frenchsentence f

P(f | e) · P(e) score for e

Translation Model

Language Model

w1 w2 P(w2 | w1)

of

the 0.13

a 0.09

another 0.01

some 0.01

hong

kong 0.98

said 0.01

stated 0.01

potentialtranslation e

P(f | e) P(e)

Page 58: What’s New in  Statistical Machine Translation

Search for Best Translation

voulez – vous vous taire !

Page 59: What’s New in  Statistical Machine Translation

Search for Best Translation

voulez – vous vous taire !

you – you you quiet !

Page 60: What’s New in  Statistical Machine Translation

Search for Best Translation

voulez – vous vous taire !

you – you quiet !

Page 61: What’s New in  Statistical Machine Translation

Search for Best Translation

voulez – vous vous taire !

quiet you – you you !

Page 62: What’s New in  Statistical Machine Translation

Search for Best Translation

voulez – vous vous taire !

shut you – you you !

Page 63: What’s New in  Statistical Machine Translation

Search for Best Translation

voulez – vous vous taire !

you shut !

Page 64: What’s New in  Statistical Machine Translation

Search for Best Translation

voulez – vous vous taire !

you shut up !

Page 65: What’s New in  Statistical Machine Translation

Classic Decoding Algorithm

Given f, find the English string e that maximizes P(e) · P(f | e)

NP-Complete [Knight 99].

Brown et al 93:“In this paper, we focus on the

translation modeling problem. We hope to deal with the [decoding] problem in a later paper.”

Page 66: What’s New in  Statistical Machine Translation

Beam Search Decoding[Brown et al US Patent #5,477,451]

1st Englishword

2nd Englishword

3rd Englishword

4th Englishword

start end

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

all sourcewords

covered

[Jelinek 69; Och, Ueffing, and Ney, 01]

Page 67: What’s New in  Statistical Machine Translation

1st Englishword

2nd Englishword

3rd Englishword

4th Englishword

start end

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

all sourcewords

covered

best predecessorlink

[Jelinek 69; Och, Ueffing, and Ney, 01]

Beam Search Decoding[Brown et al US Patent #5,477,451]

Page 68: What’s New in  Statistical Machine Translation

Classic Results• nous avons signé le protocole . (Foreign Original)• we did sign the memorandum of agreement . (Human Translation)• we have signed the protocol . (MT)

• où était le plan solide ? (Foreign Original)• but where was the solid plan ? (Human Translation)• where was the economic base ? (MT)

the Ministry of Foreign Trade and Economic Cooperation, including foreigndirect investment 40 billion US dollars today provide data includethat year to November china actually using foreign 46.959 billion US dollars and

very slow = one page per day

Page 69: What’s New in  Statistical Machine Translation

Okay!

I know, so far, this talk should be called …

What’s Old in Statistical

Machine Translation!!

Page 70: What’s New in  Statistical Machine Translation

Further Developments

• Follow-on projects– Hong Kong– Aachen– Behavior Design Corporation

• JHU Summer Workshop 1999– Build & distribute statistical MT tools– Create standard training & testing data– Disseminate tutorial material– “MT in a Day”– Ask new questions

Page 71: What’s New in  Statistical Machine Translation

How Much Data Do We Need?

Amount of bilingual training data

Quality of automatically trainedmachine translationsystem

Page 72: What’s New in  Statistical Machine Translation

Advances in Statistical MT2000-2004

Page 73: What’s New in  Statistical Machine Translation

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

Page 74: What’s New in  Statistical Machine Translation

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

+ European parliament data [Koehn 05]

Page 75: What’s New in  Statistical Machine Translation

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

BLEU Evaluation Metric(Papineni et al 02)

• N-gram precision (score is between 0 & 1)

What percentage of machine n-grams can be found in the reference translation?

Gross measure over 1000 test sentences.Not allowed to use same portion of reference

translation twice (can’t cheat by typing out “the the the the the”)

Brevity penalty: can’t just type out single word “the” (and get precision 1.0)

Page 76: What’s New in  Statistical Machine Translation

BLEU in Action

枪手被警方击毙。 (Foreign Original)

the gunman was shot to death by the police . (Reference Translation)

the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10

Page 77: What’s New in  Statistical Machine Translation

BLEU in Action

枪手被警方击毙。 (Foreign Original)

the gunman was shot to death by the police . (Reference Translation)

the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10

green = 4-gram match (good!)red = word not matched (bad!)

Page 78: What’s New in  Statistical Machine Translation

BLEU Tends to Predict Human Judgments

R2 = 88.0%

R2 = 90.2%

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Human Judgments

BL

EU

Sc

ore

Adequacy

Fluency

Linear(Adequacy)Linear(Fluency)

slide from G. Doddington (NIST)

Page 79: What’s New in  Statistical Machine Translation

Experiment-Driven Progress

Mar1

Apr1

15

20

25

30

35

May1

BLEU

2005

ISI Syntax-Based MTChinese/EnglishNIST 2002 Test Set

Evaluate new MT research ideas every day!(and be alerted about bugs…)

Page 80: What’s New in  Statistical Machine Translation

Draw Learning Curves

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

10k 20k 40k 80k 160k 320k

Swedish/EnglishFrench/EnglishGerman/EnglishFinnish/English

# of sentence pairs used in training

BLEUscore

Experiments byPhilipp Koehn

Page 81: What’s New in  Statistical Machine Translation

Flaws of Word-Based MT

• Can’t translate multiple English words to one French word

• Can’t translate phrases– “real estate”, “note that”, “interest in”

• Isn’t sensitive to syntax– Adjectives/nouns should swap order– Verb comes at the beginning in Arabic

• Doesn’t understand the meaning (?)

Page 82: What’s New in  Statistical Machine Translation

The MT Triangle

SOURCE TARGET

words words

syntax syntax

logical form

interlingua

logical form

Page 83: What’s New in  Statistical Machine Translation

The MT Swimming Pool

syntax syntax

interlingua

words words

logical form logical form

Page 84: What’s New in  Statistical Machine Translation

SOURCE TARGET

Commercial Rule-Based Systems

words words

syntax syntax

logical form

interlingua

logical form

Page 85: What’s New in  Statistical Machine Translation

Knight et al 95 - meaning-based translation - composition rules

Language Model

SOURCE TARGET

words words

syntax syntax

logical form

interlingua

logical form

Page 86: What’s New in  Statistical Machine Translation

Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text

SOURCE TARGET

Language Modelwords words

syntax syntax

logical form

interlingua

logical form

Page 87: What’s New in  Statistical Machine Translation

Yamada/Knight (01,02) - tree/string model - used existing target language parser

SOURCE TARGET

Language Modelwords words

syntax syntax

logical form

interlingua

logical form

Page 88: What’s New in  Statistical Machine Translation

Well, these all seem like good ideas.

Which one had the most dramatic effect on MT quality?

None of them!

Page 89: What’s New in  Statistical Machine Translation

Phrases

SOURCE TARGET

phrases phrases

How do you translate“real estate” into French?

real estate real number dance number dance card memory card memory stick …

words words

syntax syntax

logical form

interlingua

logical form

Page 90: What’s New in  Statistical Machine Translation

Phrase-Based Statistical MT

• Foreign input segmented into phrases– “phrase” just means “word sequence”

• Each phrase is probabilistically translated into English– P(to the conference | zur Konferenz)– P(into the meeting | zur Konferenz)

• Phrases are probabilistically re-ordered

See [Koehn et al, 2003] for an overview.

Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference In Canada

Page 91: What’s New in  Statistical Machine Translation

How to Learn the Phrase Translation Table?

• One method: “alignment templates” [Och et al 99]

• Start with word alignment

• Collect all phrase pairs that are consistent with the word alignment

Page 92: What’s New in  Statistical Machine Translation

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

Page 93: What’s New in  Statistical Machine Translation

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

Page 94: What’s New in  Statistical Machine Translation

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)

Page 95: What’s New in  Statistical Machine Translation

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

Page 96: What’s New in  Statistical Machine Translation

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap)

(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)

(a la bruja verde, the green witch) …

Page 97: What’s New in  Statistical Machine Translation

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap)

(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)

(a la bruja verde, the green witch) …

(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

Page 98: What’s New in  Statistical Machine Translation

Phrase Pair Probabilities

• A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus.

• No EM training

• Just relative frequency:

count(f-f-f, e-e-e) P(f-f-f | e-e-e) = ----------------------- count(e-e-e)

Page 99: What’s New in  Statistical Machine Translation

Phrase-Based MT

• This is currently the best way to do Statistical MT!

• What took so long to move from words to phrases?– Missing RAM

• 25m parameters billions of parameters• Trick idea: build test-corpus-specific phrase table (takes 5 hours!)• Now solved in commercial deployments

– Missing computing power– Many competing ideas to shake out

• Koehn 03 summarizes several variations– Empirical effectiveness even better than intuition would predict

• This is not building a ladder to the moon!– If you can’t translate “real estate” into French, you are sunk

Page 100: What’s New in  Statistical Machine Translation

Advanced Training Methods

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e) x P(f | e) e

Page 101: What’s New in  Statistical Machine Translation

Advanced Training Methods

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e)2.4 x P(f | e) … works better! e

Page 102: What’s New in  Statistical Machine Translation

Advanced Training Methods

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e)2.4 x P(f | e) x length(e)1.1

eRewards longer hypotheses, since these are unfairly punished by P(e)

Page 103: What’s New in  Statistical Machine Translation

Advanced Training Methods

argmax P(e)2.4 x P(f | e) x length(e)1.1 x FEAT 3.7 …

e

Lots of features vote on every potential translation.

Exponential model.

Problem: How to set the exponent weights?

IDEA 1: maximize probability of the dataIDEA 2: maximize BLEU score of MT system

Page 104: What’s New in  Statistical Machine Translation

20.64% BLEU

WTM fixed at 1.0

17.96% BLEU

plot by Emil Ettelaie

Page 105: What’s New in  Statistical Machine Translation

Maximum BLEU Training

• Novel algorithm developed by [Och 03]

• Opened gates to “feature hacking”– Word-based feature to smooth phrase pair

counts (“Model1 Inverse”)– Phrase-specific propensities to re-order

• Currently limited to ~25 features

Page 106: What’s New in  Statistical Machine Translation

Advances in Statistical MT2005

Page 107: What’s New in  Statistical Machine Translation

Google’s Language Model

• Previously, largest language model was trained on 1b words of English

• 20b words of news significant impact on news translation

• 200b words of web helpful

Page 108: What’s New in  Statistical Machine Translation

Maryland’s Hiero system [Chiang 05]

• Previously:– ne mange pas does not eat

• New phrase pairs with variables and reordering– ne X pas does not X

– le X1 du X2 X2 's X1

• Nesting– “does not X” itself becomes an X

• CKY decoderJohnCocke

Page 109: What’s New in  Statistical Machine Translation

ISI’s Syntax-Based MT System

• First strong showing for an SMT system that knows what nouns and verbs are!

• Why syntax?“Frequent high-tech exports are bright spots for

foreign trade growth of Guangdong has made important contributions.”

– Need much more grammatical output– Need accurate control over re-ordering– Need accurate insertion of function words

Page 110: What’s New in  Statistical Machine Translation

The gunman killed by police .

String Output

. 击毙警方被枪手

Page 111: What’s New in  Statistical Machine Translation

The gunman killed by police .

DT NN VBD IN NN

NPB PP

NP-C VP

S

Tree Output

. 击毙警方被枪手

Page 112: What’s New in  Statistical Machine Translation

Gunman by police shot .

NN IN NN VBD

NPB PP

NP-C VP

S

Tree Output

. 击毙警方被枪手

Page 113: What’s New in  Statistical Machine Translation

The gunman was killed by police .

DT NN AUX VBN IN NN

NPB PP

NP-C VP

S

Tree Output

. 击毙警方被枪手

Page 114: What’s New in  Statistical Machine Translation

Sample Rules Learned from Data

"说 " "," x0 0.57

"说 " x0 0.09

"他 " "说 " "," x0 0.02

"指出 " "," x0 0.02

x0 0.02

VP

VB SBAR

IN x0:S

that

said

NP

x0:NP PP

IN x1:NP

from

x1 x0 0.27

"来自 " x1 x0 0.15

x1 "的 " x0 0.06

"从 " x1 x0 0.06

"来自 " x1 "的 " x0 0.06

Page 115: What’s New in  Statistical Machine Translation

Sample Rules Learned from Data

S

x0:NP VP

x1:VB x2:NP

x0 x1 x2 0.82

x0 x1 "," x2 0.02

x0 x1 x2 0.54 x1 x0 x2 0.44

S

x0:NP VP

x1:VB x2:NP

(Chinese/English)

(Arabic/ English)

subject-verb inversion

Page 116: What’s New in  Statistical Machine Translation

Format is Expressive

S

x0:NP VP

x1:VB x2:NP2

x1, x0, x2

S

PRO VP

VB x0:NPthere

are

hay, x0

NP

x0:NP PP

of

P x1:NP

x1, , x0

Multilevel Re-Ordering

Non-constituent Phrases

Lexicalized Re-Ordering

VP

VBZ VBG

is

está, cantando

Phrasal Translation

singing

VP

VB x0:NP PRT

put

poner, x0

Non-contiguous Phrases

on

NPB

DT x0:NNS

the

x0

Context-SensitiveWord Insertion

[Knight & Graehl, 2005]

Page 117: What’s New in  Statistical Machine Translation

Story Gets More Interesting…

Applications

MT

Automata Theory

TreeTransducers(Rounds 70)

Page 118: What’s New in  Statistical Machine Translation

Story Gets More Interesting…

Applications

MT

Automata Theory

TreeTransducers(Rounds 70)

Linguistic Theory

TransformationalGrammar

(Chomsky 57)

Page 119: What’s New in  Statistical Machine Translation

Story Gets More Interesting…

Applications

MT (05)

Automata Theory

TreeTransducers(Rounds 70)

Compression (01)

QA (03)

Generation (00)

Linguistic Theory

TransformationalGrammar

(Chomsky 57)

Page 120: What’s New in  Statistical Machine Translation

Story Gets More Interesting…

Applications

Algorithms

Linguistic Theory

Automata Theory

TransformationalGrammar

(Chomsky 57)

TreeTransducers(Rounds 70)

Efficient TransducerAlgorithms

Generic TreeToolkits

MT (05)

Compression (01)

QA (03)

Generation (00)

Page 121: What’s New in  Statistical Machine Translation

Summary

• Making good progress

• Algorithms + Data + Evaluation + Computers

• Interdisciplinary work– Natural language processing– Machine learning– Linguistics– Automata theory

• Hope that more people will join!

Page 122: What’s New in  Statistical Machine Translation

Thank you

Page 123: What’s New in  Statistical Machine Translation

Syntax-Based vs Phrase-Based

Mar1

Apr1

15

20

25

30

35

May1

BLEUphrase-based system

2005

Chinese/EnglishNIST 2002 Test Set

Page 124: What’s New in  Statistical Machine Translation

Future PhD Theses?

“Syntax-based Language Models for Improving Statistical MT”

“Discriminative Training of Millions of Features for MT”

“Semantic Representations Induced from Multilingual EU and UN Data”

“What Makes One Language Pair More Difficult to Translate Than Another”

“A State-of-the-Art MT System Based on Syntactic Transformations”

“New Training Methods for High-Quality Word Alignment”

+ many unpredictable ones…

Page 125: What’s New in  Statistical Machine Translation

Summary

• Phrase-based models are state-of-the-art– Word alignments– Phrase pair extraction & probabilities– N-gram language models– Beam search decoding– Feature functions & learning weights

• But the output is not English– Fluency must be improved– Better translation of person names, organizations, locations– More automatic acquisition of parallel data, exploitation of

monolingual data across a variety of domains/languages– Need good accuracy across a variety of domains/languages

Page 126: What’s New in  Statistical Machine Translation

Available Resources• Bilingual corpora

– 100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu)– Lots of French/English, Spanish/French/English, LDC– European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI

• (www.isi.edu/~koehn/publications/europarl)– 20m words (sentence-aligned) of English/French, Ulrich Germann, ISI

• (www.isi.edu/natural-language/download/hansard/)

• Sentence alignment– Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm)– Xiaoyi Ma, LDC (Champollion)

• Word alignment– GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/)– GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html)– Manually word-aligned test corpus (500 French/English sentence pairs), RWTH

Aachen– Shared task, NAACL-HLT’03 workshop

• Decoding– ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/)– ISI Pharoah phrase-based decoder

• Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/)

• Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)

Page 127: What’s New in  Statistical Machine Translation

Some Papers Referenced on Slides• ACL

– [Och, Tillmann, & Ney, 1999]– [Och & Ney, 2000]– [Germann et al, 2001]– [Yamada & Knight, 2001, 2002]– [Papineni et al, 2002]– [Alshawi et al, 1998]– [Collins, 1997]– [Koehn & Knight, 2003]– [Al-Onaizan & Knight, 2002]– [Och & Ney, 2002]– [Och, 2003]– [Koehn et al, 2003]

• EMNLP– [Marcu & Wong, 2002]– [Fox, 2002]– [Munteanu & Marcu, 2002]

• AI Magazine– [Knight, 1997]

• www.isi.edu/~knight– [MT Tutorial Workbook]

• AMTA– [Soricut et al, 2002]– [Al-Onaizan & Knight, 1998]

• EACL– [Cmejrek et al, 2003]

• Computational Linguistics– [Brown et al, 1993]– [Knight, 1999]– [Wu, 1997]

• AAAI– [Koehn & Knight, 2000]

• IWNLG– [Habash, 2002]

• MT Summit– [Charniak, Knight, Yamada, 2003]

• NAACL– [Koehn, Marcu, Och, 2003]– [Germann, 2003]– [Graehl & Knight, 2004]– [Galley, Hopkins, Knight, Marcu, 2004]

Page 128: What’s New in  Statistical Machine Translation

Ready-to-Use Online Bilingual Data

0

20

40

60

80

100

120

140

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

Page 129: What’s New in  Statistical Machine Translation

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

+ 1m-20m words formany language pairs

Page 130: What’s New in  Statistical Machine Translation

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

Millions of words(English side)

One Billion?

???

Page 131: What’s New in  Statistical Machine Translation

From No Data to Sentence Pairs

• Easy way: Linguistic Data Consortium (LDC)• Really hard way: pay $$$

– Suppose one billion words of parallel data were sufficient– At 20 cents/word, that’s $200 million

• Pretty hard way: Find it, and then earn it!– De-formatting– Remove strange characters– Character code conversion– Document alignment– Sentence alignment– Tokenization (also called Segmentation)

Page 132: What’s New in  Statistical Machine Translation

Sentence Alignment

The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.

El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.

Page 133: What’s New in  Statistical Machine Translation

Sentence Alignment

1. The old man is happy.

2. He has fished many times.

3. His wife talks to him.

4. The fish are jumping.

5. The sharks await.

1. El viejo está feliz porque ha pescado muchos veces.

2. Su mujer habla con él.

3. Los tiburones esperan.

Page 134: What’s New in  Statistical Machine Translation

Sentence Alignment

1. The old man is happy.

2. He has fished many times.

3. His wife talks to him.

4. The fish are jumping.

5. The sharks await.

1. El viejo está feliz porque ha pescado muchos veces.

2. Su mujer habla con él.

3. Los tiburones esperan.

Page 135: What’s New in  Statistical Machine Translation

Sentence Alignment

1. The old man is happy. He has fished many times.

2. His wife talks to him.

3. The sharks await.

1. El viejo está feliz porque ha pescado muchos veces.

2. Su mujer habla con él.

3. Los tiburones esperan.

Note that unaligned sentences are thrown out, andsentences are merged in n-to-m alignments (n, m > 0).

Page 136: What’s New in  Statistical Machine Translation

Tokenization (or Segmentation)

• English– Input (some byte stream):

"There," said Bob.– Output (7 “tokens” or “words”): " There , " said Bob .

• Chinese– Input (byte stream):

– Output:

美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件。

美国 关岛国 际机 场 及其 办公 室均接获 一名 自称 沙地 阿拉 伯

富 商拉登 等发 出 的 电子邮件。

Page 137: What’s New in  Statistical Machine Translation

Lower-Casing

• English– Input (7 words):

" There , " said Bob .

– Output (7 words):

" there , " said bob .

Thethe“The“the

theSmaller vocabulary size.More robust counting and learning.

Idea of tokenizing and lower-casing:

Page 138: What’s New in  Statistical Machine Translation

Recent Progress in Statistical MT

• Why is that?– Better algorithms that learn patterns from data– More data– Faster, cheaper computers with more RAM– Community-wide test sets– Novel automated evaluation methods– Shared software tools

Page 139: What’s New in  Statistical Machine Translation

Three Problems for Statistical MT

• Translation model– Given a pair of strings <f,e>, assigns P(f | e) by formula– <f,e> look like translations high P(f | e)– <f,e> don’t look like translations low P(f | e)

• Language model– Given an English string e, assigns P(e) by formula– good English string high P(e)– random word sequence low P(e)

• Decoding algorithm– Given a language model, a translation model, and a new

sentence f … find translation e maximizing P(e) · P(f | e)

Page 140: What’s New in  Statistical Machine Translation

Web Language Models

She has a lot of nerve. [20]French input It has a lot of nerve. [3]

?

Used by Google in 2005 to increase performance of their research MT system!

[Soricut, Knight, Marcu, 02]