internationalization using locales
DESCRIPTION
Internationalization Using Locales. Achim Ruopp. Agenda. Working with multilingual data Language and locale identifiers Locale Data Frameworks for locale support Ideas/discussion how this could be used in compling. Not about character encoding. Read Jeremy’s slides from last quarter - PowerPoint PPT PresentationTRANSCRIPT
InternationalizationInternationalizationUsing LocalesUsing Locales
Achim RuoppAchim Ruopp
AgendaAgenda
Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be
used in compling used in compling
Not about character encodingNot about character encoding
Read Jeremy’s slides from last Read Jeremy’s slides from last quarterquarter• http://students.washington.edu/jgk/talkshttp://students.washington.edu/jgk/talks
/char-enc/char-encodings.pdf/char-enc/char-encodings.pdf
Use Unicode wherever possibleUse Unicode wherever possible
InternationalizationInternationalizationMore than Encoding TextMore than Encoding Text
Where are the word breaks?Where are the word breaks?
คลิ�กปุ่��มเมาส์�ขวาคลิ�กปุ่��มเมาส์�ขวาYour balance is $1234.56... I think.Your balance is $1234.56... I think.
How do I sort these words in French?How do I sort these words in French?• cotecote dimensiondimension• côtecôte coastcoast• cotécoté with dimensionswith dimensions• côtécôté sideside
How do I uppercase this word in Turkish?How do I uppercase this word in Turkish?• istiyorum - İstiyorumistiyorum - İstiyorum
How do I transcribe this text into Latin How do I transcribe this text into Latin characters?characters?• 인수문제를 인수문제를 - in'su'mun'je'reul'- in'su'mun'je'reul'
Cultural ConventionsCultural Conventions
What does this date stand for?What does this date stand for?• 3/8/20063/8/2006
What is the currency symbol for What is the currency symbol for Hungary?Hungary?
… … linguistic characteristics of linguistic characteristics of languages and cultural conventions – languages and cultural conventions – a localea locale
AgendaAgenda
Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be
used in compling used in compling
Internet Language TagsInternet Language Tags
Used today: RFC 3066 (RFC 1766)Used today: RFC 3066 (RFC 1766)• Generative:ISO 639-1/2 language tag[-ISO Generative:ISO 639-1/2 language tag[-ISO
3166 country tag] 3166 country tag] e.g. fr, en-US, ale-CAe.g. fr, en-US, ale-CA
• Registered with IANA Registered with IANA e.g. no-nyo, zh-Hante.g. no-nyo, zh-Hant
• ExceptionsExceptions x-…x-…
Several problemsSeveral problems• Dependency on ISO standardsDependency on ISO standards• No generative options for dialects etc.No generative options for dialects etc.• RFC3066bis should solve thisRFC3066bis should solve this
SIL EtnologueSIL Etnologue
Cataloging all of the world’s 6,912 known Cataloging all of the world’s 6,912 known living languagesliving languages
http://www.ethnologue.com/http://www.ethnologue.com/ Uses ISO/DIS 639-3 3-letter codesUses ISO/DIS 639-3 3-letter codes E.g. Swabian dialect: x-sil-swgE.g. Swabian dialect: x-sil-swg Hope for consolidation with RFC3066 or Hope for consolidation with RFC3066 or
successor once 639-3 becomes full successor once 639-3 becomes full standardstandard
Not so well supported in programming Not so well supported in programming frameworksframeworks
AgendaAgenda
Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be
used in compling used in compling
Types of Locale DataTypes of Locale Data
Dates/time formatsDates/time formats Number/currency formatsNumber/currency formats Collation SpecificationCollation Specification
• For sorting and comparisonFor sorting and comparison Translated names for language, region, Translated names for language, region,
script, timezones, currencies,…script, timezones, currencies,… Script and characters used by a languageScript and characters used by a language Measurement SystemMeasurement System Paper sizesPaper sizes ……
Common Locale Data RepositoryCommon Locale Data Repository
““The purpose of the Common Locale Data The purpose of the Common Locale Data Repository project is to provide a general Repository project is to provide a general XML format for the exchange of locale XML format for the exchange of locale information for use in application and information for use in application and system development, and to gather, store, system development, and to gather, store, and make available a and make available a commoncommon set of set of locale data generated in that format.”locale data generated in that format.”
http://www.unicode.org/cldr/ http://www.unicode.org/cldr/
Common Locale Data RepositoryCommon Locale Data Repository
Collection/vetting processCollection/vetting process• Contributors add/modify dataContributors add/modify data• Reviewed by commiteeReviewed by commitee
Accessible over the webAccessible over the web• Locale Data Markup Language XML Locale Data Markup Language XML
formatformat• E.g. E.g.
http://unicode.org/cldr/data/common/mahttp://unicode.org/cldr/data/common/main/fr.xml in/fr.xml
AgendaAgenda
Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be
used in compling used in compling
FrameworksFrameworksPosix LocalePosix Locale
Standard C/C++ libaryStandard C/C++ libary• LC_COLLATE – sorting/comparison LC_COLLATE – sorting/comparison • LC_CTYPE - behavior of character-handling LC_CTYPE - behavior of character-handling • LC_MONETARY - monetary formatting LC_MONETARY - monetary formatting
LC_NUMERIC – numeric formatting LC_NUMERIC – numeric formatting • LC_TIME – date/time formattingLC_TIME – date/time formatting
Used in Un*x systems for command line Used in Un*x systems for command line functions toofunctions too
Results can be platform-dependentResults can be platform-dependent Stable, but feature set stuck in the 1980sStable, but feature set stuck in the 1980s
FrameworksFrameworksICU LibraryICU Library
IBM Open Source projectIBM Open Source project Developed originally for the Taligent OS Developed originally for the Taligent OS
project in the late 80s/early 90sproject in the late 80s/early 90s Java and C++ APIsJava and C++ APIs Extensive locale data and APIs to use itExtensive locale data and APIs to use it
• http://www.icu-project.org/cgi-bin/locexp http://www.icu-project.org/cgi-bin/locexp Also includes localization supportAlso includes localization support Everybody (Mac OS X, Java, DB2, Everybody (Mac OS X, Java, DB2,
Mathworks …) is using it Mathworks …) is using it But …But …
FrameworksFrameworksMicrosoftMicrosoft
Windows NLS APIWindows NLS API Microsoft .NET Framework Microsoft .NET Framework
System.Globalization namespaceSystem.Globalization namespace Similar set of data to ICUSimilar set of data to ICU
• Vetted by subsidiariesVetted by subsidiaries APIs accessible from all MS APIs accessible from all MS
programming languagesprogramming languages Localization support in different APILocalization support in different API
Microsoft demosMicrosoft demos
Culture ExplorerCulture ExplorerMicrosoft Transliteration UtilityMicrosoft Transliteration Utility
ExtensibilityExtensibility
What if I don’t find the locale I need?What if I don’t find the locale I need? What if I need to modify some of the What if I need to modify some of the
data?data? ICUICU
• Can create new localesCan create new locales MicrosoftMicrosoft
• .NET Framework v2.0: custom cultures.NET Framework v2.0: custom cultures• Windows Vista: custom localesWindows Vista: custom locales
LDML can be interchange formatLDML can be interchange format
AgendaAgenda
Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be
used in complingused in compling
Usages for Usages for Computational LinguisticsComputational Linguistics
Up to the imaginationUp to the imagination• Transliteration use in MTTransliteration use in MT• Named Entity RecognitionNamed Entity Recognition• … … • suggestions?suggestions?
Most importantly: Do not reinvent the Most importantly: Do not reinvent the wheel!wheel!• Check if API or data you need is availableCheck if API or data you need is available
If possible write code in a language/locale-If possible write code in a language/locale-independent fashionindependent fashion
ReferencesReferences RFC3066bisRFC3066bis
• http://www.inter-locale.com/ID/why-rfc3066bis.htmlhttp://www.inter-locale.com/ID/why-rfc3066bis.html EtnologueEtnologue
• http://www.ethnologue.com/http://www.ethnologue.com/ Common Locale Data RepositoryCommon Locale Data Repository
• http://www.unicode.org/cldr/http://www.unicode.org/cldr/ Posix LocalePosix Locale
• http://www.opengroup.org/onlinepubs/009695399/basedefs/http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html xbd_chap07.html
ICUICU• http://icu.sourceforge.net/ http://icu.sourceforge.net/
MicrosoftMicrosoft• http://www.microsoft.com/globaldev/ http://www.microsoft.com/globaldev/
UNGEGN Working Group on Romanization Systems UNGEGN Working Group on Romanization Systems • http://www.eki.ee/wgrs/ http://www.eki.ee/wgrs/