The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 1
City University of Hong Kong Chinese University of Hong Kong
The HKIUG Unicode Project
Fourth Annual HKIUG Meeting8-9 Dec, 2003
Lingnan University, Hong Kong
Philip WONG, CityU LibraryHO Yee Ip, CUHK Library
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 2
City University of Hong Kong Chinese University of Hong Kong
Overview
Part I Background Problems Objective & Methodology Procedures Deliverables and Actions
Part II Follow-up Are the problems solved Future work
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 3
City University of Hong Kong Library 香港城市大學圖書館
The HKIUG Unicode Project - Part I
by
Philip Wong
City University of Hong Kong Library
December 8, 2003
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 4
City University of Hong Kong Library 香港城市大學圖書館
Background There are different character sets that support CJK. Big5 is common in HK and Taiwan, GB is used in Mainland. CCCII and EACC are mainly used in libraries. EACC is LC standard Unicode is widely supported in OS, applications and W3C.
No. of CJK char
Code space
Released Support Provide linking feature
BIG5 13,053 14,758 1984 Traditional No
GB 18030 27,000 1.6 million
2000 Trad. & Simplified
No
CCCII 75,684 830,584 1980 Trad. & Simplified
Yes
EACC 15,728 830,584 1983 Trad. & Simplified
Yes
Unicode 82,270 1.1 million
2000 (v. 3) Trad. & Simplified
No
Reference: KT Lam, “Overview of Chinese Character Encoding”, http://www.lib.cuhk.edu.hk/seminar/unicode/kt_lam_files/frame.htm
character sets
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 5
City University of Hong Kong Library 香港城市大學圖書館
Background Different character sets assigned different code points to the
same character (more precisely, the same glyth)
Character Set Code Point for 余 (yu) BIG5 A745
GB 18030-2000 5164
CCCII 213131216076
EACC 276076
Unicode 4F59
code points
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 6
City University of Hong Kong Library 香港城市大學圖書館
Background Innovative supports CJK by storing the CJK internally in EACC and CCCII The internal code is not Unicode based
100 1 |6880-01|aYu, Guangzhong,|d1928-245 10 |6880-02|aYu Guangzhong shi xuan880 1 |6100-01/$1|a 余光中 ,|d1928-880 10 |6245-02/$1|a 余光中詩選
[edit mode ctrl-w]100 1 |6880-01|aYu, Guangzhong,|d1928-245 10 |6880-02|aYu Guangzhong shi xuan880 1 |6100-01/$1|a{213131}{213272}{213034},|d1928-880 10 |6245-02/$1 |a{213131}{213272}{213034}{21585c}{215c4f}
internal codes
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 7
City University of Hong Kong Library 香港城市大學圖書館
Background Mapping table is required to convert internal codes to and from client encodings
Once a good solution, but also created many problems. Many issues have been raised and discussed over the years
Seminar on Chinese Information Processing in Libraries, HKUST Jan 1998 Good discussion list: LIB-CHINESE Listserv
mapping table
Interface Client encoding code
Internal code
Telnet Big5WebPAC Big5
Big5 EACC/CCCII
MillenniumWebPAC UTF-8
UTF-8 EACC/CCCII
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 8
City University of Hong Kong Library 香港城市大學圖書館
Problems
Problem 1 Multiple mapping of internal codes to one client code The code searched for or input to may not be the one desired Order of mappings may be different among local sites, thus inconsiste
nt results in Z39.50 searching In III UTF-8 table, there are 1150 multiple mapping cases (2232 charact
ers), including EACC and CCCII, some with high usage frequency. e.g.台 (U+53F0), 漢 (U+6F22)
Multiple mapping of 台 (tai) in UTF-8 EACC/CCCII Unicode Meaning
283b7d 53F0 simplified form of the tai in “table” 檯
27605d 53F0 simplified form of the tai in “typhoon” 颱
213538 53F0 “tai” in its proper form
27542b 53F0 simplified form of the tai in “Taiwan” 臺
multiple mapping
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 9
City University of Hong Kong Library 香港城市大學圖書館
Problems
E.g. in UTF-8
214274 (CCCII) U+65E6 旦 (dan)
27565A (EACC) U+65E6 旦
Problem 2 In multi-mapping cases, there may be overlapping use of EACC
and CCCII Overlapping introduces more multiple mappings Create workload when exchanging records with international
bibliographic services which only accept EACC
overlapping eacc & cccii
E.g. in Big5
213131 (CCCII) A745 余 (yu)
276076 (EACC) A745 余
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 10
City University of Hong Kong Library 香港城市大學圖書館
Problems
Problem 3 III mapping table contains other problems In UTF-8 (Release 2002 Phase 3)
errors27615F is mapped to U+53CB 友 ,it should be U+53D1 发
missing cases212F30 for U+3007 〇 is missing
wrong types213538 (U+53F0; 台 ) is typed as non-EACC,it should be EACC
errors & missing
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 11
City University of Hong Kong Library 香港城市大學圖書館
ProblemsAnalysis done by local sites on UTF-8 mapping between April and
June 2003
Questions: Can preferences be selected by local sites for multiple mappings? Can non-EACC codes be abandoned, those with EACC equivalents be
converted to EACC in database? Can correct type of EACC/CCCII be re-assigned based on standard?
analysis of UTF-8
Total entries: 23,669(Rel. 2002 Phase 3)
According to III Studied by local sites* by UST
# by CityU
EACC 15,290 (65%) 15,665 (66%)*multi-mapping linked: 224
multi-mapping unlinked: 47
Non-EACC 7,954 (34%) 8,004 (33%)*954 have EACC equivalents
“may be invalid internal code”
425 (1%) EACC 188#Non-EACC 237
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 12
City University of Hong Kong Library 香港城市大學圖書館
Problems
Problem 4 What triggered the HKIUG Unicode Project is the inconsistent
software mapping between Big5 and UTF-8 in multiple mapping cases: Big5 client – mapped to the first entry UTF-8 client – mapped to the last entry
software inconsistency
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 13
City University of Hong Kong Library 香港城市大學圖書館
Problems Searching 才 (cai) in WebPAC Big5 (or Telnet Big5)
Mapped to the first
Internal Big5213f7b A47E28736d A47E
software inconsistency (cont)
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 14
City University of Hong Kong Library 香港城市大學圖書館
Problems Searching 才 (cai) in WebPAC UTF-8 (or Millennium)
Mapped to the last
Internal UTF-8213f7b 624D28736d 624D
software inconsistency (cont)
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 15
City University of Hong Kong Library 香港城市大學圖書館
Objective & Methodology
A seminar was organized by CUHK in July 2003 http://www.lib.cuhk.edu.hk/seminar/unicode/
A HKIUG Working Group on Unicode Project was formed. Members: CUHK, CityU, HKU, HKUST
Objective Solve software inconsistency between Big5 and UTF-8 Decide on One-to-one mapping or Many-to-one mapping Decide on Pure EACC or EACC and CCCII Clean up errors, wrong types and missing cases Prepare to transfer to Unicode based database
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 16
City University of Hong Kong Library 香港城市大學圖書館
Objective & Methodology
The working group further decided:
Not to fix Big5 table (small character set, support only traditional Chinese, more multiple mappings, …, etc.)
Propose a new UTF-8 mapping table to Innovative For EACC mapping, follow LC standard Allow multiple mappings of EACC; for unlinked cases,
decide on the preferences For multiple mappings of EACC and CCCII, remove the CCCII Covert CCCII in database to EACC equivalents Avoid missing characters, include pure CCCII (though low
percentage in database)
(cont)
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 17
City University of Hong Kong Library 香港城市大學圖書館
Procedures
diac.utf8.hkiug
created diac.utf8.hkiug
diac.utf8
LC EACC
22717EACC/CCCII
Subtracted 66 Substitutes for Missing (U+3013)
15673EACC
7044 pureCCCII
+
Remapped 287 PUA Selected preferences in
multi-mapping linked and unlinked cases
Corrected LC mappings prepared list for CCCII to
EACC data conversionSubtracted 955 with EACC equivalent
15739 EACC merged
7999 CCCII extracted
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 18
City University of Hong Kong Library 香港城市大學圖書館
Procedures source from LC Merged tables from LC's EACC to UCS/Unicode Mappings
http://www.loc.gov/marc/specifications/specchareacc.html
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 19
City University of Hong Kong Library 香港城市大學圖書館Procedures
Included pure CCCII from UTF-8 table (Rel 2002 Phase 3)
CCCII with no EACC equivalents (pure CCCII)
e.g.217455 坓22483E 洣
7,044 Added to new table
CCCII with EACC equivalents
e.g.213131 (CCCII) 余276076 (EACC) 余
955 Excluded from new table.Sent to III for data conversion
source from diac.utf8
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 20
City University of Hong Kong Library 香港城市大學圖書館
Procedures re-mapped PUA
Re-mapped 297 Private User Area (PUA) to suggested alternates
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 21
City University of Hong Kong Library 香港城市大學圖書館
Procedures
Selected preference in multiple mapping EACC
Multiple mapping
Example # of cases
Enhanced indexing?
Labeled as
Preference
Linkedsame lower order bytes
4B3178 倩213178 倩
160(320 char)
Yes "multi-mapping linked"
not matter
Unlinkeddifferent lower order bytes
283B7D 台27605D 台213538 台27542B 台
49(108 char)
No "multi-mapping unlinked"
selected case by case (based on HKUST study on word frequency & meaning)
selected preference
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 22
City University of Hong Kong Library 香港城市大學圖書館
Procedures
Linked cases: HKIUG preference indicated
selected preference (cont)
Selected preference in EACC multiple mapping linked
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 23
City University of Hong Kong Library 香港城市大學圖書館Procedures
Unlinked cases: HKIUG preference indicated
selected preference (cont)
Selected preference in EACC multiple mapping unlinked
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 24
City University of Hong Kong Library 香港城市大學圖書館Procedures
Updated LC mappings Referenced from other sources
UnihanOCLCUSMARC Character Set for Chinese, Japanese, Korean (printed)
Examples:
273C67 LC mapped to U+E9D8
Remapped to U+5E72 ( 干 )
4B3C2b LC mapped to U+E9C7
Remapped to U+67C3 ( 柃 )
updated LC mapping
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 25
City University of Hong Kong Library 香港城市大學圖書館
Procedures
CCCII with EACC Equivalents- for data conversion
CCCII EACC
list for conversion
Prepared list for data conversion
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 26
City University of Hong Kong Library 香港城市大學圖書館
Deliverables and Actions Deliverables to Innovative
1. diac.utf8.hkiug - HKIUG version of UTF-8 mapping table
EACC 15,673Pure CCCII 7,044Total 22,717
2. hasEACC.txt - CCCII with EACC equivalents - 955
3. Final Report - Hong Kong Innovative Users Group (HKIUG) III-UTF8 Working Group Report
Actions for Innovative1. Endorse and install diac.utf8.hkiug2. Replace CCCII listed in hasEACC.txt with their EACC equivalents in the datab
ase
Note: local sites have the choice to implement the above actions or not (e.g. while adopting the new table, CUHK chose to run their own data conversion)
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 27
City University of Hong Kong Library 香港城市大學圖書館The HKIUG Unicode Project - End of Part I
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 28
The HKIUG Unicode Project - Part II
by
Ho Yee Ip
CUHK University Library Systems
December 8, 2003
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 29
Are the problems solved
Resolve Big5 and UTF8 software inconsistency? Yes (if abandon Big5 interfaces)
Use the same preferred mappings among local sites? Yes (if all sites adopt the new table)
Able to search the desired code in multiple mapping? Yes (if added entries are created)
No overlapping of EACC and CCCII in multiple mapping? Yes
Clear up all errors and missing cases? No (no-going job)
Switch 100% to Millennium? No (unfortunately, 2002 Phase 3 created more problems
…)
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 30
Are the problems solved
New problems in Release 2002 Phase 3 In Millennium Edit, implicitly convert non preferred entries to the pr
eferred entry (may be an old problem in Phase 2) Worse, this “preferred” entry may not be the HKIUG preferred one.
It is always mapped to the 2nd entry, which is wrong for multiple mappings > 2
Testing1. in Millennium Cataloguing, input 台 in braced code {283B7D}2. save record3. check in telnet edit mode (Crt-W): still {283B7D}4. re-save record in Millennium with no further editing5. re-check in telnet: become {27542b}Note: Global update or amending attached records will not invoke this conv
erting Millennium not yet ready for CJK editing!
new problem
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 31
Are the problems solved
Report from sites who have installed the new UTF-8 mapping table and run the data conversion successful? failed? unexpected outcome?
installed sites
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 32
Follow-up Continue to clean up and supplement the mapping table
Recommend updates and changes of EACC mapping to LC and III There are 169 difference mappings between III and LC. HKIUG follo
wed LC Consider this case
III choice: 2D552E U+82FA 苺 LC choice: 2D552E U+8393 莓 Obviously different
Consult:USMARC character set for Chinese, Japanese, Korean. Washington, D.C. : Li
brary of Congress, 1986. the glyth of 2D552E is 苺 (the same as III)
Is III right or LC right? Others:
232D42, 396B33, 23355C
mapping table
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 33
Follow-up Other differences between LC and III
232D42 III choice: 232D42 U+8842 衂 LC choice: 232D42 U+4610 (2 dots) minor variation US MARC (printed): 232D42 衂 (same as III)
396B33 III choice: 396B33 U+524F 剏 LC choice: 396B33 U+5259 剙 (2 dots) minor variation US MARC (printed): 396B33 剏 (same as III)
23355C III choice: 23355C U+8C63 豣 LC choice: 23355C U+86C3 蛃 Obviously different US MARC (printed): 23355C 豣 (same as III)
mapping table (cont)
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 34
Follow-up Continue to clean up and supplement the mapping table
Supplement diac.utf8.hkiug with additional CCCII source: Unihan database file latest data ( e.g. ftp://ftp.unicode.org/Public/4.0-Update1/Unihan-4.0.1d3b.zip)
Amend diac.utf8.hkiug when LC update its code standard source: LC MARC 21 code standard (
http://www.loc.gov/marc/specifications/specchareacc.html)
mapping table (cont)
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 35
Follow-up Change of cataloguing practice Provide added entries for unlinked multi-mapping codes
Source data may not be the preferred code (by meaning) Transcription should be faithful to the source Added entries enhance retrieval
e.g. 历 U+5386
历 {274349} <=> 曆 {214349}历 {27462A} preferred <=> 歷 {21462A}
Source: 万年历
added entries
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 36
Follow-up
Source: 万年历历 {274349} <=> 曆 {214349}历 {27462A} preferred <=> 歷 {21462A}
Action:
About 29 cases out of the 49 unlinked cases need attention
Data Input Data stored Retrieval by glyphs Hit?
Input the non preferred one in braced format:万年 {274349}
{274F22}{213C65}{274349}
萬年曆(i.e. by traditional glyphs: {214F22}{213C65}{214349})
Yes
Create the added entry by inputting the glyphs:万年历
{274F22}{213C65}{27462A}
万年历(i.e. by simplified glyphs: {274F22}{213C65}{27462A})
Yes
added entries (cont)
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 37
Follow-up Since Big5 mapping table is not fixed, cannot use Telnet
Big5 mode any more; explore software: AnzioWin, putty In Telnet mode, INNOPAC UTF-8 port cannot support full
screen editing, only line editing is feasible
staff mode
CJK display corrupted in full screen editing
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 38
Follow-up For some local sites, e.g. CUHK, AnzioWin is used. When
AnzioWin is set to CCCII mode, its mapping table CCCII.UNI can be used for Unicode mapping.
Deficiency: CCCII.UNI is one-to-one, non preferred entries cannot be included, e.g.,
# 274349 53D1 # not preferred274C7B 53D1
Better to use Innopac UTF-8 port when it is ready for editing
staff mode (cont)
U
niv
ers
ity L
ibra
ry S
yst
em
, C
UH
K
香港中文大學 大學圖書館系統
The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Slide 39
Future
To migrate to pure Unicode environment…. Abandoning EACC/CCCII will lose the linking of traditional, simpli
fied and variant forms.
历 U+5386 曆 U+66C6 how to link? 歷 U+6B77
Linking information is available from Unihan website. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5386
Only if this linking is maintained by the vendor, migration can be considered.