unicode in

Unicode in Unicode in

2008Q32008Q3Mark Davis, Vladimir Mark Davis, Vladimir

Weinstein, Andy HeningerWeinstein, Andy Heninger

Standard SW Globalization

Data HandlingDate, Time, Number FormattingCollationLocales/LanguagesTimezones & Calendars,…

General InternationalizationUsing character properties instead of hard-coded listsSeparation of code from localizable data (≈resource bundles)Avoiding string concatenation, dealing with truncation …

2

Where was the problem? (pause)

View

Upload

Server

Server

Data

Index

DB

dump

3

More places than you might think

1. Ensure Client App is UnicodeWindows, don’t use ANSI

2. Prevent Encoding Mismatchescharset before web form params

3. Allow full Unicode identifiersFile names,…

4. Ensure Uniform SegmentationWord ≠ [0-9a-zA-Z]+

5. Watch for hidden assumptionsCp1252 corrupting bytes

6. Title requirement 3+ chars ok for English, but not Chinese ( 狗 )

View

Upload

Server

Server

Data

Index

DB

dump

✔

✔

✔

✔

✔

✔

✔❺❺

❹❸

❷❶

❷ ❹

4

Just a few extra challenges…

Massive amounts of data

Much web cruft to deal with

Very short release cycles

Many product × language/locale pairs (next slide)

5

Locale × Product Versions

http://googleblog.blogspot.com/2008/07/hitting-40-languages.html

6

http://googleblog.blogspot.com/2008/07/hitting-40-languages.html

TranslationProfessional Vendors, Contractors, Volunteers

7

Translation StrategiesNormal Translation Memory

Multiple, very short release cyclesWeeks, not months

Product Alternatives for new featuresA. Delay release until completely translated

B. Disable new features until translated

C. Accept some English strings in new features

8

Int’l Strategy: Unicode Zone

Converters

Non-Unicode

Unicode

Unicode Zone

Validation

9

Both forms of UnicodeUTF-8: C++, python

Mixture of char*, STL string, new robust classUTF-8 is particularly good storage for the web (more later)

UTF-16: Java, Windows, Javascript, MacLibraries / Data

ICU, Joda Time, Internal librariesUnicode Character Database, Unicode Locales (CLDR)TZDB, ISO 4217 (currencies) – time sensitiveUpdate to new versions (eg Unicode 5.1) asap

10

Stable IdentifiersUnicode identifiers

Language/Locale, Script, Region, Currency, Timezone

based on BCP47, ISO 4217, TZDB

Required: unique, stable

CS = Czechoslovakia? Serbia & Montenegro?

Serbia = CS? = RS?London is in UK? GB?

Google Valid:CanonicalUS, iwNoncanonical SU, he

(deprecated / not preferred)

Google Disallowed*:Private Use XAUnassigned BBIll-Formed B1Variants i-tao, en-SCOUSE

11

User’s Locale / LanguageNeeded to improve quality

Locale = Language + (possibly) other info

Known if user is Signed In

Heuristics where not Signed In.IP Address

Accept-Language

Country from Accept-Language

Domain,…

12

Normalizing Languages/Locales

Based on Unicode locale data (CLDR)zh, und-CN, und-Hans,… ≃ zh

zh-TW, zh-Hant,… ≃ zh-TW en, und-Latn, und-US,… ≃ enen-GB, en-Latn-GB,… ≃ en-GBhe-IL, iw-IL, he-Hebr, he,… ≃

iw

13

Matching Languages/Locales

• Input: User’s requested languages, our supported languages

• Output: “best” supported language• Need better match than truncation• A “distance” metric on normalized languages

– Language, then script, then country– Plus special information:

hr vs bs, no vs nn, ro vs mo, tl vs fil

14

Web Cruft• Problems

– Bad input: charset, language,…– Inaccurate detection– Difficulties in segmentation / morphology

• These are non-trivial– Pages with conversion errors or

unassigned (non-existent) characters: ≈4%– Multiply that by billions and billions of

pages…

15

You didn’t know there was going to be a test…

• How many pages are on the web?• What’s the most frequent character?

Script? (next slides) …

16

Most Web Data17

Data in Different Scripts18

Bad Source• Original page has corrupted data• Doubly-encoded UTF-8• Random illegal control codes, unassigned chars• Forms input data of unknown/wrong encoding• Mixtures of different charsets, from

– Random pasting in non-Unicode enabled tools– Page composition (eg server-side includes), mixing

charsets– Indic font encodings

19

Bad Server• Server mis-identifies the type or encoding

of the page in the HTTP protocol. – Example: JPEGs served up as text– Server overrides page with wrong charset

• If you don’t do special detection, you get random junk– Interpreting a JPEG as windows-1252:

not altogether productive…

20

Charset Tagging Trendshttp://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

21

http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

Encoding Detection• Pages are so often untagged & mis-tagged:

– Both at HTTP and HTML levels– And what happens if they differ?

• We have to heuristically detectthe “real” character encoding

• Need to do better than the browser– In the browser, the user can adjust a bad guess

• UTF-8 source is the safest, but still must be verified

Bad Bad codescodes

{charset}_charsetEnTW…

22

Attacks!• Cross-site scripting (XSS)

– Don’t treat ill-formed UTF-8 as space (or syntax)• <p id=abc�onMouseOver=evilDoers()…

– Don’t swallow valid characters after ill-formed• …q="�>onMouseOver=…

– Don’t allow UTF-7, UTF-16 as output encodings• Browsers often mis-detect, and allow XSS.

23

Spamming/Spoofing• IDNA Spoofing: “paypal.com”• Spamming: need to detect equivalences

– http://spamsource.cn– http://spamsource ． cn fullwidth dot

– http://bücher.de – http://xn--bcher-kva.de – http://b%C3%BCcher.de

24

Language Detection• Pages are so often untagged & mis-

tagged:– Both at HTTP and HTML levels

• So, we have to heuristically determine the “real” language– Unfortunately, detecting language is

more complicated than encoding• Mixtures of languages on same page• Need to detect short strings, out of context,

without encoding• Needs to happen after entity expansion:

&#xxx; → Y– Fortunately, misdetecting language is way

less problematic than encoding

Bad Bad codescodes

en-securiden-securidEnglishEnglishxlxlChineseChinesezszsususesesesesen-us.en-us."en-us ""en-us "es-es-tses-es-tsundefined undefined espa�olespa�olutf-8utf-8

25

Non-English Languages26

Language Tags & Detection

27

If Lang Tags Normalized…

28

Tagged vs Detected29

Bad HTML• It's easy to parse valid HTML correctly• But invalid HTML is not uncommon

– We need to be as good at doing bad HTML as the browsers are

– That is, what the user sees in IE or Firefox is what needs to be indexed

• Illegal characters (controls) sneak in as character entities:

30

Segmentation Challenges• Indexing & query: breaking text into

words– ユニコードとは何か

→ ユニコード · とは · 何か • Problems if wrong:

– Source segmented as:|AB|C|

– User searches for “BC”not found– Can segment/query multiple ways

31

Thai Segmentation• คอมพวิเตอร ์จะ เก่ียวขอ้ง กับ เรื่อง ของ ตัวเลข

– Before segmentation (2007-03): 10 hits– After segmentation:→ 300,000+ hits!

• Spaces in query still make difference– คอมพวิเตอรจ์ะเก่ียวขอ้งกับเรื่องของตัวเลข

acts as a complete phrase, equals:– “คอมพวิเตอร ์จะ เก่ียวขอ้ง กับ เรื่อง ของ ตัวเลข”

32

Morphology Challenges• Varies by language• Stopwords, phrases: a, the,…• Diacriticals: sasa → saša, sasha • Decompounding: Abiball → abiball OR abi ball• “Forms” of a word: go → gone, went, …• Synonyms: car shopping → auto shopping• …

33

Correcting User Typing• Users may be on keyboard without

accents, or expect transliteration– Types “Sasha” or “Sasa” or “Саша” for

“Saša”

• Misspellings

34

Character folding• Avoid spurious input

differences– “financial” (fi lig.,

PDF)• Normalize with:

– NFC + subset of NFKC + UCA + others

• Suppress display– “➠”

Original Term

Index Term

1 ➠ omit

2 SHY omit

3 ىلص صلى4 ￦ ₩

5 Can’t can't

6 fi fi

35

SW SW Globalization Globalization

at at

Mark Mark DavisDavis

In Action• Indexing stores canonicalized originals

– … Fishing … ro◌̂les→

– … fishing … rôles

• Query expanded to variants– fish → fish|fishing– rôle → role|rôle|roles|rôles

• Expansions may be language-dependent

38

Freeform Parsing39

unicode in

Documents

unicode locale data

new featuresdelay release

new versions

unicode locales cldrtzdb

unicode identifiersfile

translateddisable new

zhtw en

web form paramsallow