arabic language challenges

24
Arabic Language Challenges Arabic Language Challenges Walid Magdy 29 Sep 2010

Upload: patty

Post on 19-Jan-2016

75 views

Category:

Documents


7 download

DESCRIPTION

29 Sep 2010. Arabic Language Challenges. Walid Magdy. This presentation is not. About my PhD Work About Arabic language technologies Description of the state-of-the-art Highly technical Duplicate to other presentations (I hope) Boring (promise). This presentation is about. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Arabic Language Challenges

Arabic Language ChallengesArabic Language Challenges

Walid Magdy

29 Sep 2010

Page 2: Arabic Language Challenges

This presentation is not

About my PhD Work

About Arabic language technologies

Description of the state-of-the-art

Highly technical

Duplicate to other presentations (I hope)

Boring (promise)

Page 3: Arabic Language Challenges

This presentation is about

Arabic language

Arabic orthographic nature

Arabic morphological nature

Arabic phonetic nature

Challenges stem from this nature

Page 4: Arabic Language Challenges

This sentence is written in Arabic Language

Page 5: Arabic Language Challenges

Arabic Language

Arabic is the largest living member of the Semitic language family

It is classified as a macro-language with 27 sub-languages

It is spoken by over 280 million people in 28 countries (middle-east)

The language of Quran (over 1.6 billion Muslims)

Page 6: Arabic Language Challenges

Arabic Language (Internet)

Internet users by language (2010)

0E+00 1E+08 2E+08 3E+08 4E+08 5E+08 6E+08

English

Chinese

Spanish

Japanese

Portuguese

German

Arabic

French

Russian

Korean

Rest of the Languages

0% 500% 1000% 1500% 2000% 2500% 3000%

English

Chinese

Spanish

Japanese

Portuguese

German

Arabic

French

Russian

Korean

Rest of the Languages

Growth in Internet (2000-2010)

Page 7: Arabic Language Challenges

Arabic Language (Types)

Current written Arabic is the modern standard ArabicUnified across all Arabic countries (news, political speeches)Easy to understand by all ArabsNot spoken by people!

Spoken Arabic (dialectic Arabic)Different across Arabic countries (regions)Semi-understandable by different Arabic dialecticNot for formal use

Classic Arabic (Language of Quran)Contains ancient Arabic wordsMostly understandable by Arabic peoplePreviously used different version of Arabic scripts

Page 8: Arabic Language Challenges

Arabic Language Nature

Orthographical nature:The way to write Arabic letters

Morphological nature:The way to construct Arabic sentences

Phonetic nature:The way to pronounce Arabic letters and words

OCR

NLP, IR, MT

ASR, T2S, S2S

Page 9: Arabic Language Challenges

Orthographical Nature

Written from right to left (letters only)

15 of the 28 letters contain dots

Characters are connected or semi-connected

Character shape depends on position

Printed text may include ligatures and kashida

Optional diacritics may be present

Page 10: Arabic Language Challenges

15 of the 28 letters contain dots

Page 11: Arabic Language Challenges

Character shape depends on position

middle begin end isolatedmiddle begin end isolated

Page 12: Arabic Language Challenges

Printed text may include kashida and ligatures

Page 13: Arabic Language Challenges

Optional diacritics may be present

Page 14: Arabic Language Challenges

It was very ambiguous

Page 15: Arabic Language Challenges

What about Arabic OCR?

Word Error Rates (WER) are considerably high

Good Arabic OCR: 30-40% WER on average

Trained on similar font: <10% WER

Ambiguous fonts: >70% WER

Omni fonts: 40% WER

Page 16: Arabic Language Challenges

Morphological Nature

Language is built of 10k roots

Short vowels are not written (diacritics)

Words contain prefix, infix, and suffix (pronouns, others)(the, and, his, her, their, it, him, them, will …) are attached to the main word

Word spelling can change according to grammatical position

No rule for plural words

60 billion possible surface forms

Page 17: Arabic Language Challenges

Short vowels are not written

In the Arabic text we do not write its short vowels and the pronouns are attached to the words

In th Arbc txt w do nt writ its short vwls and th pronuns ar attachd to th words

In thArbc txt w do nt writ itsshort vwls andthpronuns ar attachd to thwords

كتب (kataba) writeكتب (kotub) booksكتب (kattaba) let someone writeكتب (kuttiba) forced to write

Page 18: Arabic Language Challenges

Words contain prefix, infix, and suffix

ونـهـاكـتبوسـيــwasaya+ktub+unahaa

and will + write + they it= and they will write it

They are Peter’s childrenThe children behaved wellHer children are cuteMy children are funnyWe have to save our childrenHe loves his childrenHis children loves him

كتب (kataba) writeتباك (kateb) writerباكت (ketab) book

Page 19: Arabic Language Challenges

No rule for plural

Singular Plural

رجل man لارج men

كاتب writer باكت Writers

مكتب office تبامك offices

مكتبة library اتمكتب libraries

هاتف telephone تفاهو telephones

مصلي prayer نمصلي prayers

إمام leader أئمة leaders

Page 20: Arabic Language Challenges

What about Arabic IR?

Some characters are normalized

Diacritics (short vowels) are removed

Later approaches for search- Search with words- Apply light stemming for words- Apply morphological stemming for words- Simple character n-grams representation

Character n-grams achieves the bestexample: exa xam amp mpl ple

Page 21: Arabic Language Challenges

Phonetic Nature

Some phonemes are in Arabic doesn’t exist in other language (‘ein, ghain, ha, kha, Dad, Sad, Ta, Hamza)

Examples:

Mohamed (ha)

Attia (‘ein, Ta)

Khalid (kha)

Ghada (ghain)

Asmaa (Hamza)

Baraa (Hamza)

Diaa (Dad, Hamza)

Page 22: Arabic Language Challenges

What about Arabic ASR?

Needs special training and decoding

Requires huge amount of training

State-of-the-art is not bad

MASTOR by IBM

Page 23: Arabic Language Challenges

Conclusion

Arabic language is full of challenges

Research is in it early stages

Huge amount of work is still needed

Some initiatives are trying to helpALTEC: Arabic Language TEchnology Center

Page 24: Arabic Language Challenges

� شكرًا�شكرًاThank youThank you