unicode - arabic script tutorial

ARABIC SCRIPT TUTORIAL Thomas Milo - DecoType

1 INTRODUCTION

Like all multi-lingual computing, Arabic computing is now firmly in the domain of Unicode. Unicode is an industrial protocol with the status of international agreement. It is designed to encode the elements of all known script systems in such a way that they become interchangeable between programs and operating systems. Its implementation is well underway. Unicode eliminates the need to tamper with fonts to get special characters, but it is not a font. For legible text on screen and paper, Unicode depends on compatible fonts with the required characters, where necessary with additional dedicated font technology.

Thomas Milo –Arabic script Tutorial

29th Internationalization and Unicode Conference March 2006, San Francisco, CA 2

2 THE ARABIC ALPHABET

a. the primary character inventory

Arabic alphabet is related to the Latin alphabet, as can be seen from its historical sorting order A/ALEF, B/BEH, C/JEEM, D/DAL:

Its modern sorting order is on the basis of similarity of the letters:

The modern morphological order can be broken down as follows:

Historical initial letter ALEF Similar letters b t ṯ

Similar letters ǧ ḥ ḫ

Similar letters d ḏ r z

Similar letters s š ṣ ḍ ṭ ẓ

Similar letters ʿ ġ f q

Historical group k l m n

Rest h w y

b. derived primary characters There is a number of letters, mostly skeleton-cum-mark combinations, that do not have independent status in orthography or sorting order: The hamza diacritic and its various supporting letters Morphophonologic use of YEH

Morphophonologic use of HEH



c. the secondary character inventory Arabic spelling is not fully alphabetic: only short consonants and long vowels are written with the primary character set. For elaborate spelling or casual disambiguation, a set of secondary characters exists. They are written above or below a primary character, e.g.:

U+064E FATHA to mark the vowel /a/ ◌

U+064F DHAMMA to mark the vowel /u/ ◌

U+0650 KASRA to mark the vowel /i/ ◌

e.g.: kitābi

Traditionally, a repetition of the vowel marks is used at the end of a word to indicate that the indefinite article /-n/ is attached to the vowel:

FATHA – FATHA to indicate /a-n/ ◌

DHAMMA - DHAMMA to indicate /u-n/ ◌

KASRA - KASRA to indicate /i-n/ ◌

e.g.: kitābi-n

Unicode deals with repeated vowel markers as if they are separate characters. This is a legacy from the metal typesetting era, when it was impossible to compose such minute superscript or subscript groups:

FATHATAN to mark the vowel /a/ +n ◌

DHAMMATAN to mark the vowel /u/ +n ◌

KASRATAN to mark the vowel /i/+n ◌ NOTA BENE: the ending –TAN, added to the original name, means “twice”.



d. direction of writing

M H M D

Arabic script runs from RIGHT to LEFT:

D M H M

e. Letter group formation

Efficient, streamlined connections assimilate letters into continuous groups to form words. Assimilation frequently takes the form of mergers. The merger of some letter groups can be so strong that letters lose their individual characteristics and instead contribute a distinctive feature to a kind of ideograph. In other words, the writing system becomes almost synthetic in nature, although it evolved from an analytic alphabetical structure:

MHMD (pronounced: muḥammad)

For technical and pedagogical reasons, there is a strong tendency to eliminate or simplify the connectivity of Arabic script; still even the simplest fonts maintain a minimal degree of connection between letters. This approach removes from Arabic script its synthetic, ideographic quality and turns it back into the analytic alphabet from which it evolved:

MHMD



3 CONVENTIONAL ANALYSIS OF ARABIC SCRIPT Most Arabic letters consist of a skeleton, e.g. a curve, and a marker:

Markers have a distinctly graphemic function. They combine with various skeletons to form other letters, e.g. the dot-above is used by eight Arabic letters:

In the conventional analysis, some skeletons have no independent meaning, e.g.: Other unmarked skeletons by themselves are already meaningful letters that differ from the ones characterized by a marker, e.g.:

pro’s and con’s of the conventional analysis Pro: Considering the combination skeleton and marker a single letter has advantage that:

- IT MEETS THE EXPECTATION OF USERS;

- IT CONFORMS TO CONVENTIONAL AND LEGACY ENCODING. Con: For scholarly work, the merger of skeleton and marker denies the evolutionary stages of the script, where the use of markers was casual, in a way similar to the use of vowels. Therefore, modern industrial encoding as inherited by Unicode has the disadvantage that:

- IT MISREPRESENTS HISTORICAL USAGE

- IT DISRUPTS INTERNET SEARCHES BY MISMATCHING IDENTICAL GRAPHEMES In manuscripts and even in older prints, markers are often incomplete or unreliable

because markers were secondary, often redundant elements; or because markers were added later to interpret or eliminate ambiguities; because double markers sometimes co-exist to maintain original ambivalence.



4 ARCHIGRAPHEMES

A complete and unambiguous element of script is called a grapheme. Without markers, most skeletons become multi-interpretable, e.g. all these words share the same skeleton elements:

transcription and meaning Shape

ʿabdu “servant”

ʿīd “feast”

ʿinda “by, near” (preposition)

ġayad “female tenderness”

In historical texts any one of them can look like this: Transliteration Shape

EBD (capitals are used to represent indeterminate graphemes)

In this kind of spelling the skeletons are not “defective” graphemes, but valid archigraphemes. An archigrapheme is the common element(s) between two or more graphemes, minus the marker(s) that disambiguate them. The majority of historic texts are written with archigraphemes. Unicode does not – yet – have the data structure to deal with archigraphemes and discrete markers as meaningful text elements.



5 GRAPHEMES A grapheme is the smallest unambiguous unit in a writing system. Ideally graphemes correspond to the plain text units of Unicode. In Arabic most of the accepted graphemes correspond with a phoneme (the smallest unambiguous sound unit in speech):

a

b

t

ṯ

ǧ

ḥ

ḫ

d

ḏ

r

z

s

š

ṣ

ḍ

ṭ

ẓ ʿ

ġ

f

q

k

l

m

n

h

w

y However, in a few cases this correspondence is not stable:

a. there can be more than one way to encode a single grapheme, e.g.: the Arabic grapheme YEH WITH HAMZA ABOVE can have multiple encodings, which causes inconsistent usage: U+0626 YEH WITH HAMZA ABOVE U+0649 ALEF MAKSURA U+0654 HAMZA ABOVE U+06CC FARSI YEH U+0654 HAMZA ABOVE

b. More than one grapheme for a code, e.g. U+06CC FARSI YEH

shares non-final dots with

U+064A YEH

shares final forms without dots with U+0649 ALEF MAKSURA This inconsistency is not a feature of the Arabic writing system, but a consequence of the legacy approach adopted by Unicode. Accepting all graphemic markers as independent secondary characters with their own code points would make these cases unambiguous. The template for this solution already exists: in the latest version of the Unicode Standard, the combination of composition elements ALEF and HAMZA ABOVE has been declared canonically equivalent to the legacy pre-composed grapheme ALEF WITH HAMZA ABOVE: U+0627 ARABIC LETTER ALEF U+0654 ARABIC HAMZA ABOVE

U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE



ALLOGRAPHS AND LIGATURES

a. simplified support for graphic assimilation In Arabic the abstract, nominal graphemes are represented by context-dependent allographs. Simplified support for Arabic handles contextual allographs according to two patterns, discontinuous and continuous assimilation:

initial Medial final connected final unconnected Pattern

:DISCONTINUOUS د … ـد … ـدد2 allographs

ب … ـب … ـبـ…بـ …CONTINUOUS: 4 allographs

b. full support for graphic assimilation

Graphic assimilation of Arabic letters is a sophisticated art – and the foundation of Islamic calligraphy – which produces well-designed and pleasantly legible script images. Without a thorough understanding it cannot be supported. E.g. in initial position, BEH coverage can get quite elaborate in naskh:

In metal-based typography and nostalgic computer fonts, only an inconsistent number of random ligatures remain of the original system:



6 WRITING ARABIC Here are two additional aspects of Arabic script that have consequences for rendering systems:

a. horizontal and vertical connections The traditional connection is still reflected in a number of ligatures.

traditional assimilation modified assimilation

ححح

b. unstable spelling caused by changing font technology

Spelling and font technology have mutually influenced each other since the fast emergence of computer technology for Arabic script. The fast development of font technology has the unintentional result that different fonts may require different spellings for the same printed image. For instance, most fonts cannot deal with al-lāhu, “God”:

ALEF-FATHA LAM-LAM-SHADDA-FATHA-HEH-DAMMA

ALEF-FATHA LAM-LAM-HEH-DAMMA

correct data structure, wrong image wrong data structure, wrong vowel image

اهللا اللهFor comparison, the correct image representing the above data structures:

complete vowels incomplete vowels

A related phenomenon occurs when older font technology cannot handle the combination of ligatures and vowels, forcing the users into systematically misspelling words, e.g., the word al-islāmu “Islam”:

correct data structure, wrong image wrong data structure, approximate image

اإلسالم الإسلامFor comparison, the correct image representing the above data structures:

complete vowels incomplete and misplaced vowels



7 RENDERING ARABIC SCRIPT

a. font technology A font is an industrial product designed to enable handling Arabic with technology that is not designed for Arabic. In the design process, Arabic is an object that can be adapted at will: corners can be cut and rules can be broken. The resulting script can be seen as an “innovation”.

بتثب یتین بتـــثب یتین ـــــــ b. script analysis and synthesis The term script synthesis describes the effort to analyze and synthesize traditional

calligraphic styles or high quality typesetting systems. In this approach Arabic is the subject whose integrity needs to be preserved when it is reproduced in digital form. Here the underlying technology is the innovation.



8 ENCODING ARABIC SCRIPT FOR THE ARABIC LANGUAGE

a. what to encode Unicode uses a model resulting from earlier conferences about Middle Eastern computing: contextual shapes of one and the same letter are all attributed to a single nominal text code. This is the graphemic model:

ALLOGRAPH ALLOGRAPH ALLOGRAPH ALLOGRAPH GRAPHEME initial Medial final connected final unconnected Character code

U+062F د … ـد … ـدد

U+0628 ب … ـب … ـبـ…بـ …

There is single logical representation regardless the visual complexity of the assimilations, mergers or ligatures

b. code page legacy The original encoded Arabic character sets had external and internal limitations - external in the sense that only a small number of characters could be accommodated and internal in the sense that only simplified modern orthography for office use was supported. Today there is no limitation to the number of characters that can be handled simultaneously by a computer system, while the original purely synchronic, limited scope has changed into a diachronic and comprehensive ambition. Unicode is being extended with additional characters to handle literary orthography, archaic orthography, as well as contemporary Qur’anic orthography. Historical Qur’anic orthography is fully archigraphemic and therefore not supported by Unicode graphemic model. This serious defect is curiously matched in Arabic studies by the absence of an authoritative critical text edition documenting the transmission through the ages of this key historic text.



9 ENCODING ARABIC SCRIPT FOR OTHER LANGUAGES a. extra characters The Arabic character set has been expanded over time to cover speech sounds not used in the Arabic language. Practically always the existing archigrapheme-cum-marker template is used, e.g..

ڀ ٿ پ ٽ ټ ٻ ٺ ٹ ٮ

ڇ ڿ چ څ ڃ ڄ ڂ ځ ح b. regional calligraphic and typographic preferences Various user communities of the Arabic script have specific calligraphic traditions that result in preferences for certain fonts or script styles. For instance, the preferred way to write Urdu is a special form of nastaliq script1:

aے hم ۓ h دراز ی 5 ؤ 5 ؤ : ها: اس The same text in simplified naskh would not be acceptable:

اگر اس طره پر پیچ وخم کا پیچ وخم نکلے بھرم کھل جاۓ ظالم تیرے قامت کی دراز ی کا c. calligraphic preferences sometimes cause incompatible encoding There are instances where one and the same Arabic letter received a different encoding because a regional calligraphic style shaped it differently than the ubiquitous naskh. A case in point is the Arabic letter KAF, which in nastaliq has an extra swash in the final forms. Unicode now has an extra code U+06A9 KEHEH, causing identical letters to be encoded with language dependent codes. As a result, two out of the three letters of the place name MECCA are not interchangeable between various Arabic-scripted languages:

U+0645 MEEM U+0643 KAF U+0629 TEH MARBUTA مكة

همک U+0645 MEEM U+06A9 KEHEH U+0647 HEH

همک U+0645 MEEM U+06A9 KEHEH U+06D5 AE

ہمک U+0645 MEEM U+06A9 KEHEH U+06C1 HEH GOAL

ۃمک U+0645 MEEM U+06A9 KEHEH U+06C3 TEH MARBUTA GOAL

(the GOAL variants of HEH and TEH MARBUTAH are also calligraphy-based mismatches)

1 bharam khul ǧāʾē ẓālim tērē qāmat kī darāzi kā - agar us tura ē pur pēč ū ḫam kā pēč u ḫam niklē “O tyrant, the mistake about the tallness of your figure will be rectified - if the curls and twists of your hair full of curls and twists are straightened out” (Ġālib, quoted in Finn Thiessen, A manual of Classical Persian Prosody with chapters on Urdu, Karakhanidic and Ottoman prosody, Wiesbaden 1982, p.188)



10 BASIC LAY-OUT There exist three distinct line-breaking patterns in Arabic-scripted languages: a. Graphic: equidistant and equivalent spaces follow final forms and discontinuous letters2:

b. Graphemic: Only word-separating spaces and final forms are valid line breaking points:

c. Orthographic: in addition to word-separating spaces and final forms, hyphenation is used for line-breaking, just like in Latin-based orthographies:

a: Historic Arabic b: Arabic, Persian, Urdu, etc. c: Modern, non-Arabic early archigraphemic Arabic semi-alphabetic modern Arabic fully alphabetic Uyghur Turkic

NOTA BENE: so far only pattern b is documented and supported by Unicode.

2 The sample (repeated in the text columns) illustrates the spelling evolution in Arabic, as well as the complete phonologic, lexical and orthographic integration of Arabic words in Uyghur (spoken in China): Arabic: muḥammad ʿabdu l-lāh nadīm ʿarab miṣrī; Turkic: muhämmäd abdullah nadim äräb mısırlıq (Mohammed, Abdallah, Nadeem [personal names], and “Arab”, “Egyptian" – from Arabic miṣr, “Egypt”)



11 LANGUAGES Languages written with the Arabic script [millions of speakers]3 Arabic [221m]

Qurʾānic Arabic Classical Arabic Modern Standard Arabic Colloquial Arabic dialects

Algerian [22m] Baharna (Bahrain, Oman) Chadian Dhofari (Oman) Egyptian [46m] Hadrami (East Yemen, Oman) Hassaniyya [2.6m] (Mauretania)

Hijazi (KSA) Judeo-Iraqi (Israel)

Judeo-Moroccan Judeo-Tripolitanian (Lebanon) Judeo-Tunisian Judeo-Yemeni (Yemen, Israel) Libyan Mesopotamian [14m] (Iraq, Iran, Syria) Moroccan / Maghrebi [19.5m] Najdi [10m] (Saudi Arabia, Iraq, Jordan, Syria) North Levantine [15m] (Lebanon, Syria) North Mesopotamian Omani Saidi [19m] (Egypt) Sanaani (North Yemeni) Shihhi (UAE) South Levantine Sudanese [19m] Geo: Sudan Ta'izzi-Adeni (South Yemeni) Tunisian

Indo-Aryan Kurdish / Kurmanji / Northern Kurdish [26m]

Several of the Kurdish-specific letters in Unicode have no corresponding positional forms in the PRESENTATION blocks

3 This is a rough compilation that does not distinguish between current and historical use of the Arabic script; numbers of speakers have not been verified. Sources: http://en.wikipedia.org; http://www.omniglot.com; http://www.travelphrases.info/fonts.html



Persian Persian / Western Farsi (Persian of Iran) [70m]

Dari / Eastern Farsi (Persian of Afghanistan) [7m] Tajiki (Persian of Tajikistan and Afghanistan [4.4m]

Pashto / Afghan [27m] alias: Pathan, Pushto, Pashtoe, Pashtu, and Pukhto Western Balochi / Baluchi (Balochistan: Pakistan, Iran, and Afghanistan;

Turkmenistan, the Arab countries of the Gulf, and Kenya) Urdu [104m] Kalami (Pakistan) Punjabi, Lahnda (Pakistan) Sindhi [9m] (Pakistan, Sind province, India) Parkari (Pakistan)

Kashmiri / kashur [4.5m] (India, Pakistan, China, UK) Saraiki / Multani / Derawali / Western Punjabi (Pakistan) Pathwari (Pakistan) Rajasthani (India)

Turkic

Uyghur [7.6m] (China) Turkmen [6.4m] (Turkmenistan, Afghanistan, Germany, Iran, Iraq, Kazakhstan, Kyrgyzstan, Pakistan, Russia, Tajikistan, Turkey, USA and Uzbekistan. Kazak [8m] (Kazakstan, Russia and China) Kyrghyz [1.5m] ( Kyrghyzstan, China) Turkish /Osmanli Chagatai Tatar [7m] (Russian Republic of Tatarstan, and also in Afghanistan, Azerbaijan, Belarus, China, Estonia, Finland, Georgia, Kazakhstan, Kyrgyzstan, Latvia, Lithuania, Moldova, Tajikistan, Turkey (Europe), Turkmenistan, Ukraine, USA and Uzbekistan)

African Hausa / Ajami [39m]

Swahili / Kiswahili (Zanzibar, Tanzania - official, Kenya - official, Malawi, Mozambique, E. Congo, Uganda, Rwanda, Burundi, Somalia, S Ethiopia.) Mandinka [1.2m] (Senegal, Gambia (main language), Guinea-Bissau) Wolof [6.7m] (Senegal - main language, Gambia, Mauritania) Comorian (Comoros Islands) Maba [0.25m] (Africa)

SE Asia

Malay / Jawi [18m] (Brunei - co-official script, Malaysia, Indonesia, Singapore, Thailand) Malay written in Arabic is called Jawi.



Caucasian

Dargwa [2.5m] (Russian Republic of Dagestan) European

Morisco (Spanish) Bosnian (Serbian) Ukrainian

13 COUNTRIES AND AREAS WHERE ARABIC SCRIPT IS USED Afghanistan, Algeria, Bahrain, Chad, China, Cyprus, Djibouti, Egypt, Eritrea, Iran, India, Iraq, Israel, Jordan, Kenya, Kuwait, Lebanon, Libya, Mali, Mauritania, Morocco, Niger, Oman, Palestinian West Bank & Gaza, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tajikistan, Tanzania, Tunisia, Turkey, UAE, Uzbekistan and Yemen.

unicode - arabic script tutorial

Documents