keynote - computational processing of arabic dialects: challenges, advances and future directions
TRANSCRIPT
![Page 1: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/1.jpg)
Computa(onalProcessingofArabicDialects:Challenges,Advances&FutureDirec(ons
KeynoteThe2ndWorkshoponArabicCorporaandProcessingTools
LRECMay24,2016
NizarHabashNewYorkUniversityAbuDhabi
CAMeL Lab
![Page 2: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/2.jpg)
2
Roadmap
• Introduc(on• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
![Page 3: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/3.jpg)
3
IntroducBon• FormsofArabic
– ClassicalArabic(CA)• ClassicalHistoricaltexts• Liturgicaltexts
– ModernStandardArabic(MSA)• Newsmedia&formalspeechesandsePngs• OnlywriQenstandard
– DialectalArabic(DA)• Predominantlyspokenvernaculars• NowriQenstandards
• Dialectvs.Language
![Page 4: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/4.jpg)
ArabicanditsDialects• Officiallanguage:ModernStandardArabic(MSA)
Ø Noone’snaBvelanguage• Whatisa‘dialect’?
– PoliBcalandReligiousfactors• RegionalDialects
– EgypBanArabic(EGY)– LevanBneArabic(LEV)– GulfArabic(GLF)– NorthAfricanArabic(NOR):Moroccan,Algerian,Tunisian– Iraqi,Yemenite,Sudanese,Maltese?
• Socialdialects– City,Rural,Bedouin– Gender,Religiousvariants
![Page 5: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/5.jpg)
5
IntroducBon• ArabicDiglossia
– Diglossiaiswheretwoformsofthelanguageexistsidebyside
– MSAistheformalpubliclanguage• Perceivedas“languageofthemind”
– DialectalArabicistheinformalprivatelanguage• Perceivedas“languageoftheheart”
• GeneralArabpercepBon:dialectsareadeterioratedformofClassicalArabic
• ConBnuumofdialects
![Page 6: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/6.jpg)
6
CodeSwitching
الأنامابعتقدألنهعمليةالليعمبيعارضوااليومتمديدللرئيسلحودهمالليطالبوابالتمديدللرئيسالهراويوبالتاليموضوعمنهموضوعمبدئيعلىاألرضأنابحترمأنهيكونفينظرةديمقراطيةلألموروأنهيكونفياحترامللعبةالديمقراطيةوأنيكونفيممارسةديمقراطيةوبعتقدإنهالكلفي
علىموضوعإنجازاتبسبدييرجعلحظةأكثريةساحقةفيلبنانتريدهذااملوضوع،لبنانأوفيلبنانمنالنظامرئاسينظامفيلبنانالنظامعنإنجازاتالعهدلكنهليعنينعمنحكيالعهد
عمليابيدالحكومةمجتمعةوالرئيسلحودأثبتهيرئاسيوبالتاليالسلطةنظامبعدالطائفليسشخصمسؤولفيمنصبمعنيوأناعشتهذااملوضوعبأنهملابيكونفياألخيرةممارستهخالل
صالحةضمنخطابومبادئخطابملابياخدمواقفشخصيابممارستيفيموضوعاالتصاالتالسلطةالتنفيذيةألنهمنهرئيسجمهوريةهويكونرئيسمشمطلوبمنإنماهوإلىجانبهالقسم
عليهالتوجيهعليهإبداءاملالحظاتعليهبقىفيلبنانمابعدإتفاقالطائفرئيسالسلطةالتنفيذيةالوطنيةالشاملةكييظلفيمصالحةوطنيةكييظلالقولماهوخطأوماهوصحعليهتثميرجهود
باتجاهيروحتوافقمابنياملسلمواملسيحيفيلبنانيحتضنأبناءهذاالبلدمايتركاملسارفيوآمنوافيهاالليمشيوامعهالخطأنعمإنماخطابالقسمكانموضوعمبادئطرحتهوملتزمفيها
التزموافيهاأناأثبتخاللاألربعسنواتباملمارسةالحكوميةأنيالتزمتفيهاوملاالتزمنابهذاأنابتفهمتمامااملوضوعكانالرئيسلحودإلىجنبنافيهذااملوضوع،أمااملوضوعالديمقراطي
فتحإعادةانتخابهذاهالوجهةالنظربسماممكننقولإنهالدستورأوتعديلههوأوإمكانيةمسحهيئةفيجمهوريةبواليةثانيةهوديمقراطيضمناملجلسوالتصويتإلىماهنالكلرئيس
قناعتيفيهذااملوضوع.يعنيجوهرالديمقراطيةهذاباألقل
MSAandDialectmixinginspeech• phonology,morphologyandsyntax
AljazeeraTranscripthQp://www.aljazeera.net/programs/op_direcBon/arBcles/2004/7/7-23-1.htm
MSA
LEV
![Page 7: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/7.jpg)
WhyisArabicprocessinghard?
Arabic EnglishOrthographicambiguity More LessOrthographicinconsistency More LessMorphologicalinflecBons More LessMorpho-syntacBccomplexity More LessWordorderfreedom More LessDialectalvariaBon More Less
![Page 8: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/8.jpg)
ComputaBonalProcessingofStandardArabic
• TherehasbeenalargeandgrowingamountofworkonStandardArabicprocessing:– MulBplemorphologicalanalyzersandtaggers
• BAMA/SAMA,Elixir,AlKhalil,ALMOR,MADAMIRA,etc.
– MulBpletreebanksandparsers• PennATB,PragueDTB,CATiB,QuranCorpus
– LargecollecBonsofmonolingualtext• Gigaword,newscollecBons,QALB,andothers
– LargecollecBonsofbilingual/mulBlingualtext• UNcorpus,newscollecBons,etc.
– SenBmentResources• ArSenL,SLSA,SAMAR,etc.
– NottomenBonthetradiBonalresourcesonlexicography,morphologyandsyntax!
• MuchmoretodotosBll!• Resourcesandworkondialectsareverylimitedincomparison.
8
![Page 9: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/9.jpg)
9
WhyWorkonArabicDialects?• DialectsaretheprimaryformofArabicusedinallunscriptedspokengenres:conversaBonal,talkshows,interviews,etc.– SpeechrecogniBonanddialoguesystemsmustmodeldialects
• DialectsareincreasinglyinuseinnewwriQenmedia(newsgroups,weblogs,forumsetc.)– TextanalyBcsofArabicmustincludedialectalmodeling
• SubstanBalDialect-MSAdifferencesimpededirectapplicaBonofMSANLPtools
![Page 10: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/10.jpg)
ComputaBonalChallenges
• Enormousvariety– Manydialectsandsub-dialects,codeswitching
• Orthographicambiguity– Under-specificaBonandinconsistency
• Morphologicalcomplexity– morecliBcsandlessmorphofeaturesthanMSA
• Overallannotatedresourcepoverty– Thereisalotofmonolingualrawdata– Limitedlexicons– Limitedtreebanks,propbanks,etc.
10
![Page 11: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/11.jpg)
ComputaBonalSoluBons• TreatArabicdialectsasdifferentlanguages
– Buildresourcesandtoolsfromscratch• Morphologicalanalyzers,annotatedtreebanks,paralleldata…
– Pro:modeldifferentgenres– Con:expensive,effortduplicaBon
• ExploitsimilaritybetweendialectsandMSAandamongdialects– Convert(orrelate)dialectalresourcestoMSAorviceversatoadapt– Pro:lessduplicaBon,exploitsrelaBonships– Con:thereisalimittohowwellthiswillwork
• Hybridapproach• Communitystandards
– Orthography,morphologicalanalysis,POStagsets,treebanks,etc.
11
![Page 12: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/12.jpg)
12
Roadmap
• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
![Page 13: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/13.jpg)
13
DialectalPhonologicalVariaBons• Major variants
• Some of many limited variants
• /l/ à/n/ MSA: /burtuqāl/ à LEV: /burtʔān/ ‘orange’
• /ʕ/ à /ħ/ MSA: /kaʕk/ à EGY: /kaħk/ ‘cookie’
• Emphasis add/delete: MSA: /fustān/ à LEV: /fustān/ ‘dress’
MSA Dialects ق /q/ /q/,/k/,/ʔ/,/g/,/ʤ/ث /θ/ /θ/,/t/,/s/ذ /δ/ /δ/,/d/,/z/ج /ʤ/ /ʤ/,/g/,/j/
![Page 14: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/14.jpg)
ArabicScriptOrthographicVariants
IRQ LEV EGY TUN MOR/ʤ/ ج ج چ ج ج/g/ گ چ ج ڨ ڭ/tʃ/ چ تش تش تش تش/p/ پ پ پ پ پ/v/ ڤ ڤ ڤ ڥ ڥ
![Page 15: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/15.jpg)
15
LaBnScriptforArabic?• SeveralproposalstotheArabic
LanguageAcademyinthe1940s• SaidAklExperiment(1961)• WebArabic(Arabizi,Arabish,Franco-arabe)
– Nostandard,butcommonconvenBons
عربي IPA La(n عربي IPA La(nأإآءؤئ /ʔ/ ‘ 2 Ø ث /θ/ th
ة /a/,/t/ a t ط /tʕ/ t T 6
ح ħ H h 7 ع /ʕ/ ‘ 3 Ø
خ /x/ kh 7’ x 8 غ /ʁ/ g gh 3’
ذ /δ/ th ق /q/ q
ش /ʃ/ sh ch ي /y//ay//ī//ē/
y,i,e, ai,ei,…
Akl1961
![Page 16: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/16.jpg)
16
LackofOrthographicStandards
• Orthographicinconsistency
• EgypBan/mabinʔulhalakʃ/
– mAbinquwlhAlak$ مابنقولهالكش– mAbin&ulhalak$ مابنؤلهالكش – mAbin}ulhAlak$ مابنئلهالكش– mAbinqulhAlak$ مابنقلهالكش– …
![Page 17: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/17.jpg)
SpellingInconsistency
• SocialmediaspellingvariaBons– +ak– +aaaaak– +k
![Page 18: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/18.jpg)
18
ArabicLexicalVariaBon
• ArabicDialectsvarywidelylexically
• ArabicorthographyallowsconsolidaBngsomevariaBons
English Table Cat Of I_want There_is There_isn’tMSA Tāwila
طاولةqiTTaقطة
idafaØ
‘uriduاريد
yūjaduيوجد
lāyujaduاليوجد
Moroccan midaميدة
qeTTaقطة
dyālديال
bγītبغيت
kāynكاين
mākāynšماكاينش
Egyp(an Tarabēzaطربيزة
‘oTTaقطة
bitāςبتاع
ςāwezعاوز
gفي
magšمفيش
Syrian Tāwleطاولة
bisseبسة
tabaςتبع
biddiبدي
gفي
māfiمافي
Iraqi mēzميز
bazzūnaبزونة
mālمال
‘arīdاريد
akuاكو
mākuما
![Page 19: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/19.jpg)
CODA:AConvenBonalOrthographyforDialectalArabic
• Developed by CADIM for computational processing • Objectives
– CODA covers all DAs, minimizing differences in choices
– CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script
• Inspired by previous efforts from the LDC and linguistic studies
19
![Page 20: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/20.jpg)
CODAExamples
CODA االمتحانات قبل اللي الفترة صحابي ماشفتش
gloss the exams before which the period my friends I did not see
Spelling variants
متحاناتإلا بلأ ـىاللـ هالفتر ـىصحابـ شفتشماـمتحاناتلـا بلا لليإ ةرطـالفـ حابيوصـ شفتشمـ
ناتـحـاالمتـ abl ـىللـإ هرطـالفـ ـىحابـوصـ فتشوماشـناتـحـمتـإلا qbl ـيلـا il�ra Su7abi فتشوشـماناتـحــمتـلـا qabl لىا sohaby فتشوشـمـ
ilimB7anat ـيإلـ masho�ish
limBhanaat إلىilli
![Page 21: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/21.jpg)
CODAExamples
21
Phenomenon Original CODASpellingErrorsTyposSpeecheffectsMergesSplits
االجابهشبب
كبييييييييراليومبريستيج
روف املع
اإلجابةسببكبير
اليوم بريستيجاملعروف
MSARootCognate آلب، كلب قلبDialectalCli(cGuidelines
عهلبيتمشفناش
عهالبيتماشافناش
UniqueDialectWords بردو، برضو برضه
![Page 22: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/22.jpg)
CODAfica(onRawOrthographytoCODAConversion
• What:-ConvertsfromrawDAorthographytoCODA-Correctstyposandvariousspeecheffects
• Approach• Eskanderetal.(2012)(CODAFY)
• Modelspecificphenomena:hamza,PluralwAsuffix,etc.• Supervisedlearning• ClassificaBonproblem
• Farraetal.(2014)• Generalizedcharacterreplacementmodel.
• Bestresults–integratedinmorphologicalanalysis(MADA-ARZ)
CODAfica(on Accuracy(tokens)
A/YNorm.Accuracy(tokens)
Baseline(doingnothing) 76.8% 90.5%
CODAFYv0.4 91.5% 95.2%
MADA-ARZ 92.9% 95.5%
Input مشفتش صحابى الفتره الى فاتتm$s$SHAbYAlsrhAlYfAt
Output ما شفتش صحابي الفترة اللي فاتتmA$s$SHAbyAlsrpAllyfAt
• Example:
• EvaluaBon:• EgypBanArabic
![Page 23: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/23.jpg)
3ArribArabizi-to-ArabicConversion
• AsystemforautomaBcmappingofArabizitoArabicscriptinCODA
• EvaluaBon– transliteraBoncorrect83.6%ofArabicwordsandnames.
anamsh3arefa2raellyentakatboAnAm$EArfAqrAAllyAntkAtbh
انامشعارفاقراالليانتكاتبهwfelaa5ertele3fshenkwmab2raasharabicwflAxrTlEf$nkwmab2raashArAbyk
ارابيكmab2raashو+فال+اخرطلعفشنكو
(Al-Badrashinyetal.,CONLL2014;Eskanderetal.,EMNLPCodeSwitchWorkshop2014)
![Page 24: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/24.jpg)
3ArribhQp://nlp.ldeo.columbia.edu/arrib/
• x
24
![Page 25: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/25.jpg)
25
Roadmap
• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
![Page 26: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/26.jpg)
26
DialectalArabicMorphologicalVariaBon
• Nouns– Nocasemarking
• WordorderimplicaBons– ParadigmreducBon
• ConsolidaBngmasculine&feminineplural
• Verbs– ParadigmreducBon
• Lossofdualforms• ConsolidaBngmasculine&feminineplural(2nd,3rdperson)• Lossofmorphologicalmoods
– SubjuncBve/jussiveformdominatesinsomedialects– IndicaBveformdominatesinothers
• Otheraspectsincreaseincomplexity
![Page 27: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/27.jpg)
27
DAMorphologicalVariaBonVerbMorphology
conjverbobject subj tense
IOBJ negneg
MSAولمتكتبوهاله
/walamtaktubūhālahu//wa+lamtaktubū+hāla+hu/and+not_pastwrite_you+itfor+him
EGYوماكتبتوهالوش
/wimakatabtuhalūʃ//wi+ma+katab+tu+ha+lū+ʃ/
and+not+wrote+you+it+for_him+not
Andyoudidn’twriteitforhim
![Page 28: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/28.jpg)
28
Perfect Imperfect
Past SubjuncBve Presenthabitual
Presentprogressive
Future
MSAكتب
/kataba/يكتب
/jaktuba/يكتب
/jaktubu/يكتبسـ
/sajaktubu/
LEVكتب
/katab/يكتب/jiktob/
يكتببـ/bjoktob/
يكتببـعم/ʕam bjoktob/
يكتبحـ/ħajiktob/
EGYكتب
/katab/يكتب/jikBb/
يكتببـ/bjikBb/
يكتبهـ/hajikBb/
IRQكتب/kitab/
يكتب/jikBb/
يكتبد/dajikBb/
يكتبرح/raħjikBb/
MORكتب/kteb/
يكتب/jekteb/
يكتبكـ/kjekteb/
يكتبغـ/ʁajekteb/
DAMorphologicalVariaBon
![Page 29: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/29.jpg)
29
DAMorphologicalVariaBonVerbconjugaBon
Perfect Imperfect
1S 2S♂ 2S♀ 1S 1P 2S♀
MSA ت كتبـ /katabtu/
تكتبـ /katabta/
تكتبـ
/katabti/
كتب ا
/aktubu/
كتب نـ
/naktubu/
ين كتبـتـ/taktubīna/
ـيكتبـتـ
/taktubī/
LEV ت �كتبـ/katabt/
تي كتبـ
/katabti/
كتب ا/aktob/
كتبنـ /noktob/
ـيكتبـتـ
/toktobi/
IRQ ت �كتبـ/kitabt/
تيكتبـ
/kitabti/
كتب ا/aktib/
كتب نـ/niktib/
ينكتبـتـ
/tikitbīn/
MOR ت كتبـ/ktebt/
�تي كتبـ/ktebti/
كتب�نـ/nekteb/
وا�كتبـنـ/nektebu/
ـيكتبـتـ
/tektebi/
![Page 30: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/30.jpg)
MorphologicalAmbiguity
• Morphological richness – Token Arabic/English = 80% – Type Arabic/English = 200%
• Morphological ambiguity – Each word: 12.3 analyses and 2.7 lemmas
• Derivational ambiguity العني – the eye, the water spring, Al-Ain city, the notable
![Page 31: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/31.jpg)
Analysisvs.DisambiguaBon
Will will Ben Affleck be a good Batman?
PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated
PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)
NOUN_PROP biyn Ben
ADJ bay~in Clear
PREP bayn Between,among
Morphological Analysis is out-of-context Morphological Disambiguation is in-context
أفليكفيدورباتمان؟بنيهلسينجح
![Page 32: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/32.jpg)
Analysisvs.Disambigua(on
Will Ben Affleck be a good Batman?
PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated
PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)
NOUN_PROP biyn Ben
ADJ bay~in Clear
PREP bayn Between,among
Morphological Analysis is out-of-context Morphological Disambiguation is in-context
*
أفليكفيدورباتمان؟بنيهلسينجح
![Page 33: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/33.jpg)
W-3 W-2 W-1 W0 W1 W2 W3 W4 W-4
MORPHOLOGICAL ANALYZER
MORPHOLOGICAL CLASSIFIERS
• Rule-based
• Human-created
• Multiple independent classifiers • Corpus-trained
2nd
3rd
5th 4th
1st
RANKER
• Heuristic or corpus-trained
MADA (Habash&Rambow 2005;Roth et al. 2008) MADAMIRA (Pasha et al., 2014)
![Page 34: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/34.jpg)
MADAMIRA• NewesttoolfromtheCADIMgroup(Pashaetal.,
2014)• CombinesMADA(Habash&Rambow,2005)and
AMIRA(Diabetal.,2004)– MorphologicaldisambiguaBon– TokenizaBon– Basephrasechunking– NamedenBtyrecogniBon
• MSAandEgypBanArabicmodes• Server-modewithXMLinterface• Onlinedemo
– hQp://nlp.ldeo.columbia.edu/madamira/– hQp://camel.abudhabi.nyu.edu/madamira/
InputArabicText
MorphologicalDisambigua(on
Tokeniza(on
BasePhraseChunking
NamedEn(tyRecogni(on
UserNLPApplica(ons
![Page 35: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/35.jpg)
MorphologicalDisambiguaBon
System MDMRA-MSA MADA-ARZ
TrainingData MSA MSA ARZ MSA+ARZ
TestSet MSA EGY
All 84.3% 27.0% 75.4% 64.7%
POS+Features 85.4% 35.7% 84.5% 75.5%
FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%
Lemma(za(on 96.1% 67.1% 86.3% 82.8%
BasePOS-tagging 96.1% 82.1% 91.1% 91.4%
ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%
wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0
w+ kAtb
wkAtbوكاتب and (the) writer of
![Page 36: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/36.jpg)
CALIMA-EgypBanv0.5• CALIMAistheColumbiaArabicLanguageMorphological
Analyzer• CALIMA-EGY
• Extends the EgypBan Colloquial Arabic Lexicon (ECAL) (Kilany et al.,2002) and Standard ArabicMorphological Analyzer (SAMA) (Graff etal.,2009).
• Follows the part-of-speech (POS) guidelines used by the LDC forEgypBanArabic(Maamourietal.,2012b).
• AcceptsmulBpleorthographicvariantsandnormalizesthemtoCODA(Habashetal.,2012).
• Incorporates annotaBons by the LDC for EgypBan Arabic. (~ 250Kwords)
![Page 37: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/37.jpg)
CALIMA-ARZExample
katab_1LemmamA_katabt_lahA$CODAmA/NEG_PART+katab/PV+t/PVSUFF_SUBJ:2MS++li/PREP+hA/PRON_3FS+$/NEG_PART
POS
not+write+you+to/for+it/them/her+notGloss
katab_1LemmamA_katabit_lahA$CODAmA/NEG_PART+katab/PV+it/PVSUFF_SUBJ:3FS+li/PREP+hA/PRON_3FS+$/NEG_PART
POS
not+write+she/it/they+to/for+it/them/her+notGloss
mktbtlhA$ مكتبتلهاش
![Page 38: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/38.jpg)
CALIMA-EgypBanv0.5
• IncorporatesLDCARZannotaBons(p1-p6)– 251Ktokens,52Ktypes– AnnotaBoncleanupneeded– ExtendsSAMA(StandardArabicMorphAnalyser)
System TokenRecall
TypeRecall
SAMAv3.1(StandardArabic) 67.7% 59.7%CALIMA-EGYv0.5(EgypBancore) 88.7% 75.8%CALIMA-EGYv0.5(++SAMAdialectextensions) 92.6% 81.5%
![Page 39: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/39.jpg)
MorphologicalDisambiguaBon
System MDMRA-MSA MADA-ARZ
TrainingData MSA MSA ARZ MSA+ARZ
TestSet MSA EGY
All 84.3% 27.0% 75.4% 64.7%
POS+Features 85.4% 35.7% 84.5% 75.5%
FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%
Lemma(za(on 96.1% 67.1% 86.3% 82.8%
BasePOS-tagging 96.1% 82.1% 91.1% 91.4%
ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%
wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0
w+ kAtb
wkAtbوكاتب and (the) writer of
![Page 40: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/40.jpg)
MorphologicalDisambiguaBon
System MDMRA-MSA MDMRA-EGY
TrainingData MSA MSA EGY MSA+EGY
TestSet MSA Egyp(anArabic(EGY)
All 84.3% 27.0% 75.4% 64.7%
POS+Features 85.4% 35.7% 84.5% 75.5%
FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%
Lemma(za(on 96.1% 67.1% 86.3% 82.8%
BasePOS-tagging 96.1% 82.1% 91.1% 91.4%
ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%
![Page 41: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/41.jpg)
MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/
ي •
![Page 42: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/42.jpg)
MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/
![Page 43: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/43.jpg)
MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/
![Page 44: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/44.jpg)
44
Roadmap
• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
![Page 45: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/45.jpg)
Towards Morphological Tagging of a New Dialect?
• Review the literature – Hidden gems from previous efforts
• Data Collection • Data Annotation
– Guidelines: CODA, POS tags, etc. – Noisy automatic processing: Egyptian MADAMIRA? – Training annotators, quality control – This is necessary to benchmark at least
• Building the Morphological Analyzer – Eskandar et al. (2013)’s technique for paradigm completion – Salloum and Habash’s (2011) ADAM method for extending MSA
• Building the Morphological Tagger – MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) – Other tagging techniques
45
![Page 46: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/46.jpg)
Towards Morphological Tagging of a New Dialect?
• Review the literature – Hidden gems from previous efforts
• Data Collection • Data Annotation
– Guidelines: CODA, POS tags, etc. – Noisy automatic processing: Egyptian MADAMIRA? – Training annotators, quality control – This is necessary to benchmark at least
• Building the Morphological Analyzer – Eskandar et al. (2013)’s technique for paradigm completion – Salloum and Habash’s (2011) ADAM method for extending MSA
• Building the Morphological Tagger – MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) – Other tagging techniques
46
• Curras Corpus (Jarrar et al., 2014)
• Gumar Corpus (Khalifa et al., 2016)
![Page 47: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/47.jpg)
The Gumar Corpus: A Morphologically Annotated Corpus of Gulf Arabic
• ~100 million words • Mainly long conversational novels published
anonymously online ( النتروايات ‘Internet novels’). • Writers of the novels remain anonymous under
pen names. Although there is no claim of copyrights, it is conventional to credit the writer when the material is copied/transferred as per the writer request.
![Page 48: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/48.jpg)
السالم علیكم
القصه هاذي قطریه روعه أتمنى انها تعجبكم طبعا أهي قصه منقوله من منتدى ثاني
وطبعا مصرحه الكاتبه نقل القصه مع ذكر اسمها وهي الكاتبهتحفه فنیه )) القطریه ((
نبدأ .....
الكاتبة تحفة فنیة
الفصل االول :-
وضحة والتوتر بدا یظهر علیها : الجازي ماتدرین عمي متى بیجي ؟ الجازي : واهللا یختس مدري بس ماهو باطي ،اله انت وشعندس الیوم على ابوي
؟ اخبرس ما تحبین مقعاد معاه ؟توترها : سالمتس بس بغیت اسلم علیه قبل ما وضحة وهي تحاول السیطرة على یجي حمد و نروح البیت ، قدلي كم مرة اجي وال القاه عد مهب عدله من زمان
ماوجهته . الجازي وهي تغمز عینها : ماوجهتي ابوي وال تنطرین ناس ؟
خجل على طول صار وجه وضحة احمر مثل الطماطم ، والجازي اعتبرت انه وتمت تضحك على وضحة ما تدري ان سبب احمرار وضحة هو القهر وجرح
الكرامة الى تحس به من بدت تلمح عن راشد و تقول في نفسها ماتدرین یالجازي، وفي هذه اللحظة انزلت علیهم ام راشد مرت عم ان اتمنه العمى وال اشوفه
وضحة جایه من غرفتها وفي ایدها كیسه كبیره ومدته على وضحة وهي تقول :خلها توزعه كلن وضحة یمس هذي صوغتن لكم من عند راشد عطیها امس
تعطیه حقه .
An example of raw text (Qatari) from a novel
![Page 49: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/49.jpg)
Gumar Corpus Statistics
Words 112,410,688 Sentences 9,335,224 Documents 1,236
• Words are whitespace tokenized and the counts include punctuation.
• Number of sentences represents the number of lines. • Each document generally represents a single novel
![Page 50: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/50.jpg)
Gumar Corpus Dialect Distribution
(Document level)
Dialect Percentage SA 60.52 AE 13.35 KW 5.91 OM 1.13 QA 0.65 BH 0.94 GA (other) 10.03 Arabic (other) 7.93
• 92% of the corpus is written in GA with SA being the most dominant.
• GA (other) are the cases of a novels containing a combination of several GA dialects. Or the case of dialect ambiguity (esp. between OM, QA and AE)
• The rest of the corpus (7.93%) is mostly MSA (original text or translation attempts of existing non Arabic text) and other DA such as Egyptian, Iraqi, Levantine, ... etc.
![Page 51: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/51.jpg)
Morphological Analysis Evaluation
• Preliminary investigation into GA annotation are performed.
• 4000 words from text are annotated manually for: – Orthography (CODA) – Morphology (tokenization) – Part-of-speech – Lemma
• Same text was given to MADAMIRA (MSA & EGY) – Outputs are then evaluated against the gold standard.
![Page 52: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/52.jpg)
Gulf CODA
• CODA: Conventional Orthography for Dialectal Arabic (Habash et al. 2012).
• There exist CODA guidelines for both EGY and PAL (Palestinian Arabic).
• CODA guidelines for different dialects share general rules that applies to all.
• Exceptional cases differs from one dialect to another.
![Page 53: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/53.jpg)
Gulf CODA • One main feature that is different among dialects is the
root consonant mapping rules.
• General rules: spelling Al, Ta Marbuta, clitic attachment • Other examples of specific spelling…
سيدا، مب، مانيب، +ج\+ك
MSA/CODA Variants CODA Compliant CODA non-compliant
قدام /q/ or /ɡ/ or/ʤ/ ق جدام
�كبد /k/ or /ʧ/ or /ts/ ككذب
�جبدتسذب
جلس /ʤ/ or /j/ ج يلسشاي /ʃ/ or /ʧ/ ش چاي
![Page 54: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/54.jpg)
CODAfied text examples
Example 1 Raw ياويلتس منتس هالحتسي اسمع
CODA ياويلج منج هالحكي اسمعEnglish
Example 2 Raw جاهز؟ الغدى عسى
CODA جاهز؟ الغدا عسىEnglish
Example 3 Raw الجامعهفياللحنياناصغيررونهمنيبساره
CODA الجامعةفيالحنياناصغيرونةمانيبسارةEnglish
![Page 55: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/55.jpg)
An Annotation Example
![Page 56: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/56.jpg)
Morphological Analysis Evaluation
• Preliminary investigation into GA annotation are performed.
• 4000 words from text are annotated manually for: – Orthography (CODA) – Morphology (tokenization) – Part-of-speech – Lemma
• Same text was given to MADAMIRA (MSA & EGY) – Outputs are then evaluated against the gold standard.
![Page 57: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/57.jpg)
Morphological Analysis Evaluation
• Accuracy measure for the annotated features again the automatic output of MADAMIRA in two modes (MSA and EGY)
• MADAMIRA-EGY outperforms MADAMIRA-MSA on different metrics, confirming that it is better to use it as a baseline for manual annotation.
• Similar conclusions were reported by Jarrar et al. (2014)
Feature MADAMIRA-MSA MADAMIRA-EGY
Ortho 83.81 88.34
Morph 76.16 83.62 POS 72.37 80.39 Lemma 64.03 81.51
![Page 58: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/58.jpg)
Summary & Future Directions • Arabic dialects pose many challenges to NLP
– No orthographic standards – Limited resources – Large number of differences from MSA
• A combination of solutions works best – Exploit similarities between dialects and MSA – Exploit similarities among dialects – Address differences through resource building
• Our goal is to make basic support for MSA and Dialects at the level of English – So, we can focus more on higher level applications!
![Page 59: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/59.jpg)
Summary & Future Directions Although dialect processing may seem daunting, just remember • Breathe! There are rules in the dialects. Just not the
same rules as the ones in MSA.
• All these challenges are amazing opportunities to advance NLP – Not just for Arabic but for all languages.
• For Arabic native speakers, working with dialects is an eye opener (and can be a lot of fun!)
![Page 60: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/60.jpg)
Announcements • Project MADAR
– Multi-Arabic Dialect Applications and Resources – QNRF funded project – Collaboration among CMUQ, NYUAD and Columbia – Modeling 25 Arabic city dialects
• Lexical resources, parallel data, dialect id, dialect MT – Looking for linguists and postdocs!
• WARDAT 2016 – First Workshop on Arabic Dialect Technologies – Discuss future of collaborations on Arabic Dialect Technologies – Funded by the NYUAD Institute; to be held in NYU Abu Dhabi – By invitation. Limited slots. Contact me if interested.
• CAMeL Lab – Hiring postdocs! – Funded NYU PhD in Computer Science. – Contact me if interested.
![Page 61: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/61.jpg)
Announcements • Project MADAR
– Multi-Arabic Dialect Applications and Resources – QNRF funded project – Collaboration among CMUQ, NYUAD and Columbia – Modeling 25 Arabic city dialects
• Lexical resources, parallel data, dialect id, dialect MT – Looking for linguists and postdocs!
• WARDAT 2016 – First Workshop on Arabic Dialect Technologies – Discuss future of collaborations on Arabic Dialect Technologies – Funded by the NYUAD Institute; to be held in NYU Abu Dhabi – By invitation. Limited slots. Contact me if interested.
• CAMeL Lab – Hiring postdocs! – Funded NYU PhD in Computer Science. – Contact me if interested.
![Page 62: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/62.jpg)
Announcements • Project MADAR
– Multi-Arabic Dialect Applications and Resources – QNRF funded project – Collaboration among CMUQ, NYUAD and Columbia – Modeling 25 Arabic city dialects
• Lexical resources, parallel data, dialect id, dialect MT – Looking for linguists and postdocs!
• WARDAT 2016 – First Workshop on Arabic Dialect Technologies – Discuss future of collaborations on Arabic Dialect Technologies – Funded by the NYUAD Institute; to be held in NYU Abu Dhabi – By invitation. Limited slots. Contact me if interested.
• CAMeL Lab – Hiring postdocs! – Funded NYU PhD Program in Computer Science. – Contact me if interested.
![Page 63: Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions](https://reader033.vdocuments.mx/reader033/viewer/2022050914/588863501a28abad0d8b55f7/html5/thumbnails/63.jpg)
• http://nyuad.nyu.edu/en/
63
Thank You! Questions?