word segmentation in urdu - informatics homepages...

26
Word Segmentation in Urdu Nadir Durrani Institute of Natural Language Processing University of Stuttgart SarmadHussain Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences

Upload: buithuan

Post on 05-Feb-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Word Segmentation in Urdu

Nadir DurraniInstitute of Natural Language Processing

University of StuttgartUniversity of Stuttgart

Sarmad HussainCenter for Research in Urdu Language Processing

National University of Computer and Emerging Sciences

Page 2: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Road Map

• Urdu Word Segmentation

– Space Omission Problem

• Non-Joiners and Urdu Orthography

• Joiners and definition of Word

– Space Insertion Problem– Space Insertion Problem

• Affixation

• Compounding

• Proper Nouns

• Foreign Words

• Abbreviations

Page 3: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Contd..

• Model

• Algorithm

• Results

Page 4: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Why do we need words ?

• Tokenization is a foremost task in all NLP applications.

– Syntactic and Semantic Analysis in Machine Translation is based on words and neighboring onesbased on words and neighboring ones

– Spell Checker requires word boundary information for error word in order to suggest list of possible corrections

– A POS tagger should know word boundaries to tag them properly

Page 5: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Segmentingwordsshouldn’tbehard

• For Latin based languages like English, French and

Dutch etc. space and punctuation marks are a good

approximation

• In some Asian languages white space is never used to

determine word boundaries. Text is written in a

continuum.

Page 6: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Word Segmentation Problem in Asian

Languages

• Chinese

• Thai

• Khmer

• Burmese

• Dzongkha

• Lao

Word segmentation in these languages is a “Space Omission Problem ”

Page 7: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Word Segmentation Problem in Urdu

• Urdu, A Unique Case

• Like some Asian languages multiple words can be written in continuum without inserting any space

• Unlike these Asian languages space is a frequently used character. • Unlike these Asian languages space is a frequently used character. However, its presence does not necessarily imply word boundary

• So Urdu is also a “Space Insertion Problem” along with “Space Omission” problem

�����a ��aa�� ���������a�a�� ������

��a��� ��

�� ����� � ! �

��"# $! %aa&'aa( #a)��# *+

a,#"�-�%aa.!�/��a �0�# 1"'

Page 8: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Urdu Orthography

• Urdu is cursive in nature

• Characters acquire different shapes as they join with neighboring characters

• Urdu has two types of characters– Joiners can acquire 4 different shapes namely initial,

medial, final and isolated• Arabic Letter Meem can take initial: م , medial: م , final: م and

isolated: م

– Non-joiners can only acquire final and isolated shapes• Arabic Letter Dal can only take final: د and isolated د

Page 9: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Orthographic Rules for Urdu

Word Joiners Example Non-Joiners Example

Start Initial Shape سجدم Isolated جالد

Some where in

Between

Medial after J

Initial after NJ

رهمنبامد

Final after J

Isolated after NJ

ردبنردنا

End Final after J مجع Final after J دبنInitial after NJ

مجعمکا Isolated after NJ

دبندر

J = Joiners , NJ = Non-Joiners

Red = Shape in Consideration , Blue = Context

Page 10: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Notion of Space in Urdu

• Notion of space is completely alien in Urdu hand-writing

• Children are never taught to leave space when starting a new wordstarting a new word

• Following sample clearly shows that space is not used in hand-written Urdu

Page 11: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

How space became part of Urdu?

• Space has become part of Urdu text because computer can not handle it without space

• If a word ends with a joiner character next word must be started by putting a space character otherwise two be started by putting a space character otherwise two words would join and the text would look visually inappropriate

Badshahi)سجدمیبادشاہ– Mosque)سجدیمبادشاہ

• However space does not always mean word boundary

Page 12: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Space Omission Errors

Non-Joiner Word Ending

• Putting space is no longer and obligation if a word ends with a non-joiner

– �%a"�a� �2a��3a�45�a����#a"%a"-6 7– �%"�� �2��3�45�����#"%"-6 7 �%"�� �2��3�45�����#"%"-6 7–Troop leader Ahmed Sher Doger said

• Each word ends with a non-joiner so its up to the users whether or not he wants to put the space

Page 13: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Space Omission Problem

Joiner Word Ending

• Two words written without space even when first

word is ending with a joiner

• Triggered by disagreement on definition of word

Category ExamplesCategory Examples

Oblique pronouns followed by case marker آپکا vs. آپ کا (yours)

Some abstract nouns preceded by singular

demonstratives

اس وقت .vs اسوقت (at

that time)

Some postpositions combine with their genitive case

markers

کيطرف vs. کی طرف(towards)

Sometimes helping verbs are written with root verbs

without any space

کریگی vs. گی کرے(will do)

Sometimes there is no other reason but something

introduced as stylistic variation and now lexicalized

کے ليے vs. کيليے(for)

Page 14: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Space Insertion Problem

• Space is often used as a word boundary but

not always

• In some cases space occurs in between a word • In some cases space occurs in between a word

that haves multiple morphemes

• The problem is to delete the spaces that do

not mean word boundaries

Page 15: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Cases for Space Insertion

Cases Description Examples

Affixation Derivational affixes are written with space

when first morpheme ends with a joiner

شادی شده غير(married) + (un)

-do- Some times joined and separated versions

have different spellings

مزے دار .vs مزیدارdelicious

Compounds Compounds are written with space when

first morpheme ends with a joiner

ثانيہ نشاطRenaissance

Both Often compounding is used in combination

with affixation to create more problems

بےیارومددگارHelpless

))گار)مدد) ((و) (یار(بے((Without (Friend) (and) ((Help)er))

Page 16: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Contd..

Cases Description Examples

Reduplication Reduplication is used in Urdu to put emphasis

or express multiplicity or variety

کبھی کبھیSometimes

Sometimes first morphemes repeats itself or

will occur in a format X-(some character)-X or

sometimes changing a vowel of X to /a/

بخود خودAutomatically

/sometimes changing a vowel of X to /aٹھيک ٹھاک ٹھيک ٹھاکAlright

Proper Names Some proper nouns are written with space in

between

اباد اسRمIslamabad

Foreign Words Lexicalized foreign words with multiple

morphemes are written with space between

ٹيلی فونTelephone

Abbreviations Abbreviations when transliterated in Urdu are

written with spaces in between

پی ایچ ڈیPhD

Page 17: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Lack of definition of word in Urdu

Single Word Confusion Two Words

Reduplication برابر دھڑا دھڑ آہستہ آہستہ

Equal One after another Slowly

One word Not Sure Two words

Compounds نظم و ضبط تباه و برباد اسلم و عمرانDiscipline Destroy Aslam and Imran

One word Not Sure Two words

Category Level

Single Morpheme Words 1

Affixation 2

Abbreviations Compounds Reduplication 3

Page 18: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Model

Page 19: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Algorithm

•�����a ��aa�� ���������a�a�� ������

��a��� ��

�� ����� � ! �

��"#

$! %aa,#a&'aa( #a)��# *+"�-�%aa.!�/��a �0�# 1"'

• Remove the diacritics and tokenize into

orthographic words OW

•8�9 8�����9 8�� ����9 8��9 8�9 8����9 8��������9 8����9 8����� � ! ��"#9 8$! %9 8)��# *+9 8( #9 8&'9 8,#9 8"�-�%9 8.!�9 80�#9 8/��9 81"'9

Page 20: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Space Omission Problem

• For each OW we do lexical lookup for spelling variation and break them into words– کيليے will get fixed into کے ليے

• Maximum matching algorithm(dynamic programming algorithm) algorithm)

• 10 best segmentations for that OW based on minimum word heuristic

• These segmentations are merged with segmentations from other OW’s

Page 21: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Contd..

• After running space omission module we have

segmented our sentence to morphemes

•8�9 8�����9 8�� ����9 8��9 8�9 8����9 8���9 8�����9 •8�9 8�����9 8�� ����9 8��9 8�9 8����9 8���9 8�����9 8����9 8"#98 ! ��8 9 � 8 9�����9 8 %8 9$!9 8)��# *+9 8( #9 8&'9 8,#9 8"%89"�:9 8.!�9 80�#9 8/��9 8"'8 919

Page 22: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Space Insertion Module

• All segmentations are then passed to – Affixation Handler (morpheme statistics + POS)

– Reduplication Handler (single edit distance)

– Abbreviation Handler (finite state automata)

– Compounder Handler (simple lexical look-ups)

8������9 8�� ����9 8�������9 8���9 8���������9 8"#98 ! ��8 9 � 8 9�����9 8 % 98$!9 8)��# *+9 8( #9 8&'9 8,#9 8"%89"�:9 8.!�9 80�#/��9 8"'8 919

• Best segmentation is selected based on three different heuristics– Min word heuristic

– Unigram Statistics

– Bigram Statistics

Page 23: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Annotation

Page 24: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Train & Test

• Training was done on a morpheme segmented corpus of 70k words

• Testing was done a very small corpus of 2367 • Testing was done a very small corpus of 2367 words

– 404 Segmentation Errors

• 221 Space Omission Errors

• 183 Space Insertion Errors

– Affixation (66), Compounding (63), Abbreviation (32), Reduplication (22)

Page 25: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Results

Page 26: Word Segmentation in Urdu - Informatics Homepages Serverhomepages.inf.ed.ac.uk/dnadir/NAACL10-Talk.pdf · Lack of definition of word in Urdu Single Word Confusion Two Words Reduplication

Questions?

• Thank you for listening…

• Acknowledgement:

– Special thanks to NAACL executive for funding my – Special thanks to NAACL executive for funding my

travel