www.cdacnoida.in 1 internationalization localization & unicode karunesh arora vijay gugnani...

54
www.cdacnoida. in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

Post on 19-Dec-2015

231 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

1

Internationalization Localization & Unicode

Karunesh Arora

Vijay Gugnani

C-DAC Noida

Page 2: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

“Everyone has the right... to seek, receive and impart

information and ideas through any media regardless of

frontiers” -- Universal Declaration of Human Rights

Page 3: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

3

Internationalization

Internationalization, which is often referred as i18n, depicts the practice of designing and developing a application, product or document in a way that makes it easily localizable for target audiences that vary in culture, region, or language.

Page 4: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

4

Why Internationalization?

• To remove barriers to local and international access

• Adaptation to local, regional, linguistic or cultural needs.

• To provide global reach

• ROI, Revenue generation

Page 5: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

5

Internationalization Vs. Localization

Localization is the actual adaptation to meet the language, cultural, and other requirements for specific target audience.

While internationalization gives us the technology and tools to target a given audience, it’s the act of localization that makes it accessible.

Page 6: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

6

What goes with localization?

• Localization is much more than translation.

Specifically, localization refers to adaptation to other language, which involves appropriate:

– Language Translation– Locale transformation and Cultural aspects

Page 7: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

7

Language Translation

• Most languages are used in many countries, not just those where they are dominant or “official”

• People migrate and take languages with them

• Over enough time, most languages evolve differently in different locations

Languages and Countries

Page 8: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

8

Scripts and Languages• A “script” may be defined as collection of related

characters

– It is common for several languages to share most, but not all characters from a given script

– Scripts are often given the same name as one of the languages that uses them

• Arabic script, but Arabic, Farsi, Urdu,… languages

– Scripts are also given common name for a group of languages• Devanagri script for Hindi, Marathi, Nepali, Konkani etc.

Language Translation:

Page 9: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

9

Language Translation

• Identify ‘Translatable’ and ‘Non-translatable’ strings• Gender and number agreement, ordering of segments in a sentence

e.g. Page number ->

e.g. Number of pages ->

• Many languages can take at least 30% more space Tool –

उपकरण (HI) & ग्रा�हक - customer (EN)

– Design should be compatible, or else the UI may have to be redesigned– Narrow columns often cannot accommodate long Target language

equivalent words

Some Points to consider:

Page 10: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

10

• Avoid ambiguous phrases• ‘Display options’

– Options of the display -- as Noun Noun– Show the options (all of them) – as Verb Noun

• Proverbs and metaphors may not have equivalents in target language

• Keep Web pages and paragraphs short. • Avoid text in graphics.• Use simple grammatical structures. • Use everyday language. • Provide clues.

Language Translation

Some Points to consider… Contd.:

Page 11: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

11

• Follow source language conventions.

• Avoid acronyms.

• Abbreviations may have to be expanded when translated

• Check spelling and grammar.

• The more compact the source writing, the longer the Translation

• Brief translators about the purpose and target audience

• All items in a menu or set of check boxes should have the same grammatical structure

Language Translation

Some Points to consider… Contd.:

Page 12: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

12

Locale

• Set of parameters that define the user’s language, country and cultural preferences

Page 13: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

13

Different aspects of locale

• Names & Titles• Calendars,• Numeric, Date and Time formats, Addresses,• Currencies, Paper size, Weights & measures• Input Mechanism, • Language Selection,• Oral Pronunciation

Page 14: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

14

Titles and Names

• In India, it is required to specify etc.)– these titles do not necessarily translate

• Family name is not always last (In South & West part of country)

• Sorting can be based on last name or first

• Salutations in letters (e.g. Dear) are different in different locales e.g.

Page 15: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

15

Titles and Names

Source: Delhi Press Prakashan

Page 16: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

16

Calendars

• The Gregorian calendar should not always be assumed– Proper localization of some software requires the use (at

least as an option) of calendars distinct to a culture• E.g. Vikram Samvat/ Saka / Hijri calendar in India

• Calendars of various religions where year 0 was not 2006 years ago

– Fiscal-year based calendars vary widely• Some have 13 months (364/28) or 53 weeks

Page 17: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

17

Date formats

• Date separators depend on locale ‘/’, ‘-’, ‘.’

• ‘am’ and ‘pm’ are not used universally (many cultures use 24 hour clock)– ISO standard dates are unambiguous yyyy-mm-dd

hh:mm:ss

Non ISO date 01-03-02 means different things in different locales. If not using ISO, then display dates in the locale of the user Preferably use a ‘long’ form with the month spelled out (in the correct

language)

Page 18: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

18

Formatting Numbers

• locale dependent, not the language of application• Group separation

– Number of digits in a group• In English and ISO it is 3 while for Indic languages its

different 1,23,456 i.e. ##,##,##,###– Group separator

• In English ‘,’, but ISO uses space, and some locales use ‘.’ or none

• Decimal separator ‘.’, ‘.’, ‘,’• Negative symbol ‘-’, ‘~’, ‘(…)’

Page 19: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

19

Currency

• Use the currency symbol of the data– i.e. INR doesn’t automatically translate to £ or $ when

the locale changes

• Format depends on the user’s locale, not the currency– Differences in formats:

• Symbol

• Position (before or after the currency)

• Blanks separating the symbol from the data

Page 20: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

20

Currency contd…

Different ways of expressing Rs. 1000

Rs.1000 OR Rs. 1000/- or Rs.1,000/- or Rs. 1000.00INR 10001000 Rupees 1000 रुपये�

Strong currencies like Indian need decimal precision (e.g. 2 digits after the decimal point for paisa)

Page 21: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

21

Language selection

• Avoid using national flags to choose preferred language– Multiple countries use the same language

• Display of language selection order?

• Language of displaying languages ?– In the language itself, or with a translation in the default language of the

operating system

Page 22: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

22

Pronunciation

• Important for Speech based systems

– Higher recognition accuracy can be obtained by tailoring voice input to regional dialects

– Voice output in the wrong dialect can make an application sound ‘foreign’

– Applications supported with regional dialects have better impact

Page 23: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

23

Culture

• Culture is a complex collection of experiences which condition daily life;

• It includes • history, • social structure, • geographical effects, • religion, • traditional customs and everyday usage.

Page 24: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

24

Cultural issues

• Icons, symbols and images

• Colors, myths, beliefs and feelings

• Humour

• Geographical & environmental effects

• Customs & traditions

• Social Security Numbers

Page 25: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

25

Icons & Symbols

• Icons that are a play on words do not translate– e.g.

• A dust bin for dumping files• A rocket for launching an application• A scissors for cutting in edit operation• “B”, “I”, “U”

• Some concepts have been found extremely hard to represent as an icon– E.g. Sorting (‘A->Z’ is not universal)

• Images of people or body parts such as hands– Considered inappropriate in some cultures– What skin color do you use?– People Images need to be localized for each country

Page 26: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

26

Colors & Humour

• The color white may represent purity and green prosperity in the Indian context, but it may not be the same in another culture.

• Humour generally does not get translated

• People are sensitive to different things in different cultures

• Jokes/cartoons can be offensive

Page 27: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

27

Customs & Traditions

• In the Indian culture, people show respect to their elders and renowned personalities by addressing them in plural.

e.g. Dr. Manmohan Singh is the prime minister of India.

डॉ�. मनम�हन सिं��ह भा�रत क� प्रधा�नम�त्री� ह�।

Similarly, in social relationships, there are several words to address a relation

e.g. for ‘uncle’ - चा�चा�, त�ऊ, म ��

Page 28: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

28

Unicode provides a unique number for every character,

no matter what the platform,no matter what the program,no matter what the language.

Unicode?

Source: http://unicode.org

Page 29: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

29

Universal Character Encoding

• Unique number for every character

Page 30: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

30

Unifies all Languages

• 96 thousand characters, so far

• All characters accessible at the same time, in the same document:

क, க, ಔ,…

Page 31: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

31

Wide Spread Support

• Developed & supported by industry leaders:– Apple, HP, IBM, JustSystem, Microsoft, Oracle,

SAP, Sun, Sybase, Unisys, …

• Supported in standards: – XML, HTML, Java, ECMAScript (JavaScript),

LDAP, CORBA 3.0, WML, Perl, etc.

• Implemented in:– All modern operating systems, browsers, and other

products

Page 32: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

32

IDN

–http://भा�षा�.in

Page 33: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

33

Information about Unicode

• www.unicode.org

– Online Standard

– Technical Reports

– FAQs

– General Information

– Discussion Forums, Conferences

Page 34: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

34

Resources Availability

• System APIs:

– Windows, Java, Unix, Oracle, DB2, Sybase, Mac, Linux, …

• Languages

– Java, JavaScript, C#, Perl 5.6.0, C, C++, SQL, …

• Cross-platform libraries:

– ICU, Rosette, …

Page 35: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

35

Indic Support in Unicode

• ISCII the basis for characters and allocation

• DIT is member of Consortium

• Reports have been submitted on missing characters, clarifications or corrections of usage

Page 36: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

36

ISCII : Similarities

• Within script, layout and contents nearly identical

• Independent + dependent vowels

• Halant model for representing conjuncts

– conjuncts / half-forms not directly encoded

– represented by sequences instead

• Phonetic sequence – order in syllables

Page 37: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

37

ISCII : Differences

• Unicode is stateless:

– No shifting to get different scripts

– Each character has a unique number

• Unicode is uniform:

– No extension bytes necessary

– All characters coded in the same space

Page 38: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

38

Advantages

• Accessible Information across the globe

• Seamless multilingual documents

• Opens up software export market, beyond English

• Connects India to the world

Page 39: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

39

The Future

• The world is moving rapidly to Unicode

• Unicode makes India open to the world– The world comes to you, and– You go to the world

Page 40: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

40

Multiple Forms

• UTF-8: maximal compatibility with 8-bit systems

• UTF-16: good storage, interoperability with Windows/Java

• UTF-32: simplest processing

• Fast, lossless conversion

Page 41: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

41

W3C Internationalization Activity

Page 42: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

42

• Presentation / Styling issues– Styling of first character

If some styling feature is to be applied to the starting character, then whether it will be applied to a single character, conjunct character, a syllable or a Grapheme cluster.

e.g.

स्थि�तित (Position)

प्रस्था�न (Departure)

स्वर (Vowel)

को�श (Dictionary)

हिंदी& (Hindi)

हिन्दी& (Hindi)

क्षे त्री�ये  (Regional)

Some Issues under discussion in IL

Page 43: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

43

• Presentation / Styling issues– Styling of first character

Some Issues under discussion in IL

Page 44: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

44

• Presentation / Styling issues– In Cursive Text

like Arabic and Urdu

the styling is applied

to whole word

Saabiq -> Former

Urdu

Source: Rashtriya Sahara

Some Issues under discussion in IL

Page 45: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

45

• Presentation / Styling issues– Vertical arrangement of characters

If some string is written in vertical mode, then writing each character on a new line may not be suitable

http://www.w3.org/International/notes/firstletter.html

Some Issues under discussion in IL

Page 46: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

46

• Presentation / Styling issues

– Horizontal spacing

e.g.

Some Issues under discussion in IL

Page 47: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

47

• Presentation / Styling issues– Bullets and numbers

Number schemes to be supported in Indian languages also.

Some Issues under discussion in IL

Page 48: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

48

• Presentation / Styling issues

– Collation

A means to search and order data in a way that makes sense in their particular culture

Myths - One collation is good enough Unicode enabled – sorting is already covered

Some Issues under discussion in IL

Page 49: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

49

• Presentation / Styling issues

Some Issues in Indian Languages

Page 50: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

50

• Presentation issues– Underlining of the characters

अन्ये भा�षा�ओं म+ भा� अन,वा�दी

Some Issues under discussion in IL

Page 51: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

51

• Searching issues

– Problem in searching in languages sharing same script and some words being same but semantically different

Some Issues

Page 52: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

52

Issues on presentation on other devices

• Addressing Input mechanism, predictive input for vernacular languages

• Handling display issues in Hand held devices with smaller screen, in cases of translation

• Standardizing encoding issues in communication for taking care of cost of bandwidth (ISCII / Unicode / Compressed Unicode), connectivity and on-the-fly conversion of encodings

Page 53: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

53

References and acknowledgements

• http://www.w3.org/international

• Articles by Richard Ishida, Felix Sasaki, W3C

• http://macchiato.com/slides/UnicodeAndIndia.ppt , Presentation by Mark Davis

• www.site.uottawa.ca/ftppub/courses/Winter/csi5122/coursenotes/5122Internationalization.ppt

Page 54: Www.cdacnoida.in 1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida

ww

w.c

dacn

oida

.in

54

Thank you