application of unicode€¦ · web viewfirst written oct. 5, 2000 for 14.2 revised in may 2001...

USER DOCUMENTATION

Application of Unicode

Ex Libris Ltd., 2002Version 16Last Update: August 18, 2003

Table of Contents

1 OVERVIEW..........................................................................................................3

2 BASIC CHANGES FROM NON-UNICODE VERSIONS...............................3

3 INTRODUCING UNICODE AND UTF-8.........................................................3

4 PRE-UNICODE MULTI-SCRIPT......................................................................4

5 UNICODE MULTI-SCRIPT DISPLAY.............................................................4

5.1 WEB Browser............................................................................................4

5.2 PC Client....................................................................................................5

6 COMPARISON OF PC SETTINGS (PRE-14.2 – 14.2)....................................7

7 USING UNICODE IN ALEPH - /ALEPHE/UNICODE TABLES..................8

7.1 tab_character_conversion_line...................................................................8

8 MARC8_TO_UTF CONVERSION PROGRAM SPECIFICATIONS.........10

9 BASIC UNICODE TABLES (/ALEPHE/UNICODE)....................................12

10 COMPARISON OF TABLES PRE-14 AND POST-14 (FROM 14.2) (/ALEPHE/CHAR_CONV AND /ALEPHE/UNICODE).......................................14

11 CATALOGING............................................................................................14

12 Filing and Word Breaking.............................................................................14

© Ex Libris Ltd., 2002 of 14Application of UnicodeLast updated for Version 16 August 18, 2003

1 OverviewUnicode was introduced to the ALEPH 500 system with release 14.1, in July, 2000.

Release 14.1 was the first step in Unicode implementation. All bibliographic data was stored in UTF-8, but all administrative data (e.g., patron registration, vendors, item records, etc.) was stored in the local standard (such as ISO LATIN-1). From release 14.2 (December 2000) the entire system is in Unicode, and administrative data is also stored in UTF-8.

Within this new development, ALEPH 500 retains the “char_conv” principle familiar to users of previous versions. The tables enable translation of characters from two-byte to one-byte representation and vice-versa, for sorting and display purposes. ALEPH is retaining the “alpha” indicator in fields, although this may appear to be redundant in a Unicode environment. For the moment, it is used for detecting right-to-left fields (H, A). It might have additional functionality in the future.

2 Basic changes from non-Unicode versionsThe /alephe/char_conv directory is no longer in use. It has been replaced by /alephe/unicode. All the character conversion tables are new and conform to UTF-8 standards.

Deciding which table is relevant for each instance has changed from program control to table control. The library installation is now able to assign the table to be used for a particular purpose (e.g., building sort keys for patron index, order index, etc.).

Another basic change is the revamping of word building and filing (sorting) procedures. This is not directly related to Unicode, but happened at the same time.

3 Introducing Unicode and UTF-8To quote from http://www.unicode.org/unicode/standard/WhatIsUnicode.html:

“Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.


http://www.unicode.org/unicode/standard/WhatIsUnicode.html:

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”

Until the application of Unicode, characters were grouped in sets and limited to the code space of 256 characters. This does not suffice. For example, over 400 characters are required for implementing all the characters used by European languages that are based on the Latin character set. As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space. The result has been multiple, inconsistent character code sets, no easy way to deal with multilingual data, and no transparent transfer of data between computer systems.

With Unicode, there is one standard encoding for all characters, theoretically using two bytes (16 bits) for each character. UTF-8 maps Unicode values, using one byte for “English” characters (i.e., A-Z, a-z, numbers, punctuation, etc.), two bytes for many other characters (accented characters, Hebrew, Arabic, etc.), and three bytes for CJK (Chinese, Japanese and Korean).

4 Pre-Unicode multi-scriptMulti-script functionality of bibliographic data in non-Unicode versions of ALEPH was possible due to the presence of a script identifier (“alpha”) for the field. The “alpha” designation of a field or record, together with simultaneous activation of several code pages, meant that a large range of characters could be displayed in spite of the single byte environment. A different font was assigned for each of the “alpha” sets (e.g., FontS01=18Courier New Cyr for “alpha S”). The mapping of special characters in the ANSELand MAB character sets in the char_conv tables allowed for expansion of the character set, but always within the limit of the designated font.

5 Unicode multi-script display

5.1 WEB BrowserNetscape 6 and Microsoft Internet Explorer 5 support display of UTF-8. However, all data on the page that is displayed must be in the UTF-8 standard. A problem might arise if a person is not using a Unicode-enabled browser. Therefore, in order to display UTF-8 data in browsers that do not support UTF-8, ALEPH uses a “fallback” 256-character code page. The mapping from Unicode to the code page is defined in a table in the Unicode directory (e.g., unicode_to_8859_1). The default table is set in the /alephe/www_server_defaults file:

setenv server_default_charset "iso-8859-1"

Note:


WEB OPAC users should always set the MS Internet Explorer Encoding to Auto-Select.There are known problems with the Unicode implementation on Netscape 6, and therefore it is preferable to use MS Internet Explorer 5.

5.2 PC ClientYou can set more than one font for the PC client system, assigning a different font for different ranges of the Unicode character set. Although one Unicode font can be set for the entire range of characters, you might want to use multiple fonts, for two reasons. First, a complete Unicode font is heavy on computer resources and for many sites might not be required. Secondly, most Unicode fonts (such as Cyberbit) do not support the full range of Unicode characters, and a different font may be required for special characters (e.g., old Cyrillic). There are two places in the client concerned with font setup:

\Alephcom\Tab\font.ini\Catalog\Tab\catalog.ini

\Alephcom\Tab\font.inifont.ini is used to define the font (and its attributes) to be used for a particular range of characters in the Unicode set, for each part of the GUI client window. The structure of the font.ini table is as follows:

Column 1:

- window part- EditorField: text in the cataloging draft window- EditorDescription: description of the tag in the cataloging draft

window (taken from codes.<lng>)- EditorTag: tag in the cataloging draft window - ListBoxCaption: caption at the head of a column in a list box- ListBox##: text in a list box. Each column is identified by a number;

## can be used to signify all columns- UnicodeEdit: window for inputting text for Find, Scan, Jump- EditorForm: cataloging

Columns 2 and 3:

- Range of characters in Unicode set (from-to inclusive). The list is read top-down, and a general catchall range (0000-FFFF) can be defined as the last range in the list.

Column 4:

- Face name of font

Columns 5, 6, 7:

- Attributes (5=bold, 6=italic, 7=underline)


Column 8:

- Font size; note that the font size must be coordinated with the grid in which it is displayed. For example, this parameter should be coordinated with the line height parameter for the cataloging draft window in the catalog.ini.

Column 9:

- Opening mode; this defines the character set within the font, and is related to the fact that one font can contain many character sets. For example, “Courier New Cyr” contains both ISO-Latin-2 and Cyrillic characters, and you can define which character set will be used with a font. The default is DEFAULT_CHARSET, which is the window default for the PC. The way to check this is through Programs -> Accessories -> Character map on your PC.

Possible values are:

ANSI_CHARSETDEFAULT_CHARSETSYMBOL_CHARSETSHIFTJIS_CHARSETHANGEUL_CHARSETGB2312_CHARSETCHINESEBIG5_CHARSETOEM_CHARSETJOHAB_CHARSETHEBREW_CHARSETARABIC_CHARSETGREEK_CHARSETTURKISH_CHARSETTHAI_CHARSETEASTEUROPE_CHARSETRUSSIAN_CHARSETMAC_CHARSETBALTIC_CHARSET


Example:EditorTag 0000 FFFF Courier N N N 16 DEFAULT_CHARSET

EditorField 0000 00FF Tahoma N N N 16 DEFAULT_CHARSETEditorField 0401 045F Tahoma N N N 16 DEFAULT_CHARSETEditorField 0384 03CE Tahoma N N N 16 DEFAULT_CHARSETEditorField 05D0 05EA Tahoma N N N 16 DEFAULT_CHARSETEditorField 0000 FFFF Bitstream Cyberbit N N N 16 DEFAULT_CHARSET

ListBoxCaption 0000 00FF Tahoma N N N 14 DEFAULT_CHARSETListBoxCaption 0401 045F Tahoma N N N 14 DEFAULT_CHARSETListBoxCaption 0384 03CE Tahoma N N N 14 DEFAULT_CHARSETListBoxCaption 05D0 05EA Tahoma N N N 14 DEFAULT_CHARSETListBoxCaption 0000 FFFF Bitstream Cyberbit N N N 14 DEFAULT_CHARSET

ListBox## 0000 00FF Tahoma N N N 16 DEFAULT_CHARSETListBox## 0401 045F Tahoma N N N 16 DEFAULT_CHARSETListBox## 0384 03CE Tahoma N N N 16 DEFAULT_CHARSETListBox## 05D0 05EA Tahoma N N N 16 DEFAULT_CHARSETListBox## 0000 FFFF Bitstream Cyberbit N N N 16 DEFAULT_CHARSET

UnicodeEdit 0000 00FF Tahoma N N N 16 DEFAULT_CHARSETUnicodeEdit 0401 045F Tahoma N N N 16 DEFAULT_CHARSETUnicodeEdit 0384 03CE Tahoma N N N 16 DEFAULT_CHARSETUnicodeEdit 05D0 05EA Tahoma N N N 16 DEFAULT_CHARSETUnicodeEdit 0000 FFFF Bitstream Cyberbit N N N 16 DEFAULT_CHARSET

EditorForm 0000 0000 FFFF Courier N N N 16 DEFAULT_CHARSET

Note: The cataloging EditorForm remains in grid implementation, one character per grid square. Therefore, a proportional font is not suitable.

\Catalog\Tab\catalog.iniThe font size definition in catalog.ini defines the character grid for tag, indicator and sub-field codes, all of which must be set in non-proportional font. It also sets the character grid for cataloging forms, as long as these forms remain non-proportional. (This will change in the future.)

FontSizeX=10FontSizeY=16

6 Comparison of PC settings (pre-14.2 – 14.2)- Font definitions have been moved from alephcom.ini and catalog.ini to alephcom\

tab\font.ini.- F9 in the client, for setting font and colors, is no longer in use. Colors are now set

by windows setup.- Alephcom\tab\charset.dat is no longer in use; it is now part of font.ini.


7 Using Unicode in ALEPH - /alephe/unicode tablesCharacter conversion is required by various aspects of the system. The tab_character_conversion_line table defines which conversion table is to be used for each of these aspects.

7.1 tab_character_conversion_lineThis table defines the procedure and table to be used in various instances when character conversion is needed. The character conversion procedure is system-set, but the character conversion table is determined by the library application. The system continues to use “alpha” for fields, and the table is set up taking this into account. Most of the lines in the table translate the data for communication with other systems.

The columns of the table are the following:

col. 1: instance (e.g., LOCATE - translation of data when creating string for locate)

col. 2: environment (e.g., PC, WWW or “any”)

col. 3: alpha code of the line or record, for further refinement of col.1 (e.g., H, L, R, S, A)

col. 4: name of the procedure to run

- line_sb2line_utf (translates line of data from single byte character to UTF-8)

- line_utf2line_sb (translates line of data from UTF-8 to single byte character)

col. 5: character conversion table

col. 6: backslash notation indicator This is used for transposition from single byte to utf and vice versa (sb_to_utf and utf_to_sb) for import/export, in order not to lose data. The notation is a backslash and the hexadecimal value of the character.

The instances for column 1 are:

RLIN_TO_UTFTranslation of data imported from RLIN (UE_03) to UTF-8

YBP_TO_UTF Translation of data imported from YBP (p-file-96) to UTF-8


UTF_TO_URL translation of the URL link in field 856 from UTF-8 to standard required for URL

UTF_TO_WEB_MAILtranslation of UTF-8 bibliographic data for MAIL and PRINT options in WEB OPAC

LOCATE translation of data for the locate query; this data can be further translated according to the setup of the particular conf file in /alephe/gate

FILING-KEY-nntranslation for filing purposes. This is not system set, but it must be coordinated with the char_conv line of the library’s /tab/tab_filing table. The filing table listed for FILING-KEY is created using UTIL P/3. (See further under Basic Unicode tables: 7. Unicode_to_filing_01_source.)

VENDOR_NAME_KEYtranslation for sorting the Vendor index by name

COURSE_NAME_KEY translation for sorting the Course Reading index

ADM_KEYWORD_KEYtranslation for keyword indexing in ADM clients, such as budget, vendor, etc.

BORROWER_NAME_KEYtranslation for sorting the Patron index by name

ACQ_INDEXtranslation for the Acquisitions Order Index

OCLC_To_UTFtranslation of data imported from OCLC to UTF-8

MARC8_TO_UTF translation of MARC-8 data to UTF-8.

The following routines are defined for clients such as Z39.50, which work in a single character set environment.

8859_1_TO_UTFUTF_TO_8859_18859_8_TO_UTFUTF_TO_8859_88859_7_TO_UTFUTF_TO_8859_78859_5_TO_UTFUTF_TO_8859_5UTF_TO_MARC8UTF_TO_MAB


MARC8_TO_UTF conversion is different from the above procedures. For this procedure, col.6 (the character conversion table) should be left blank, since the procedure is set to use the following tables:

marc8_ara_to_unicode marc8_heb_to_unicode marc8_eacc_to_unicode marc8_lat_to_unicode marc8_greek_to_unicode marc8_rus_to_unicode

In addition, some of the conversion values are set in the program itself, and not in the tables.

8 MARC8_TO_UTF conversion program specificationsThe program takes a string of up to 2000 characters in MARC-8 encoding.

Each record can contain sequences in more, than one character set. Such a sequence start is identified by an Escape character (X"1B") plus 1 or 2 additional characters, that define a specific character set as follows:

X"1B" + "(B" Latin character set,X"1B" + "(2" Hebrew character set,X"1B" + "(3" Arabic character set,X"1B" + "(N" Cyrillic character set,X"1B" + "(S" Greek character set,X"1B" + "$1" EACC character set,X"1B" + "s" Latin character set,X"1B" + "g" Greek symbol set,X"1B" + "b" Subscript set,X"1B" + "p" Superscript set.

When there is no character set escape sequence, the default set is Latin.

Each sequence is translated to UNICODE and then to UTF, using a table specific to the character set. The tables are:marc8_ara_to_unicode marc8_lat_to_unicodemarc8_eacc_to_unicode marc8_greek_to_unicode marc8_rus_to_unicodemarc8_heb_to_unicode

In the Hebrew, Arabic, Cyrillic, Greek character sets, Greek symbol set, Subscript and Superscript sets each character is translated to one UNICODE (single byte translation).


EACC character set translates each three MARC-8 characters to one UNICODE character.

The MARC-8 (Latin) character set can contain combining characters (between X"E0" and X"FE" except X"FC" and X"FD"). For these characters, the translation is done on sequences up to the end of the field or subfield, whichever comes first.

In MARC-8, combining characters always precede the character with which they are combined. In Unicode, combining characters always come after the character with which they are combined. Some character sequences that contain a combining character can be translated to a single Unicode character (e.g. “a” with grave accent).

The marc8_lat_to_unicode table defines sequences of combining + base characters that translate to a single Unicode character. The left-hand column of the table is the Unicode character. The right-hand column can include up to 4 characters, which, taken together, are equivalent to one UNICODE character. For example:01E3 e5b5 When the program finds a combining character, it examines the next character(s) until it finds a non-combining character, within the next 3 characters. This results either in a pair, a triplet or a quadruplet of characters. The group is checked against the marc8_lat_to_unicode table, and translated accordingly. If the pair, triplet or quadruplet is not found in the table, the combining characters are transposed after the non-combining character, and each of the characters is individually translated from MARC-8 to UNICODE. For example, the MARC-8 input is: X"E0" + X"E1" + X"41". If no such combination is found in the table, the output will be: X"41" + X"E0" + X"E1" (when each character is translated to Unicode, this will become U+0041 U+0309 U+0300)).

If no non-combining character is found within the 4-character string, the program continues to look for a non-combining character in the string (until end of field or subfield), and positions the combining characters after a non-combining character, as described, bypassing the marc8_lat_to_unicode table check.

If there is a sequence of combining characters with no following non-combining character, the characters are translated to UNICODE, and left in their original order.

If combining characters appear before an Escape (denoting character set change):

if the Escape is to-Latin, the group of combining characters is dealt with in relation to the first character after the Escape sequence; in other words, the Escape is ignored.

if the Escape is to a non-Latin single-byte character set (Hebrew, Greek, etc.), then the first character after the Escape is translated to Unicode, and the combining characters listed before the Escape are transposed (placed after the character) and translated from MARC-8 to Unicode. There is no attempt to translate to a combined character (as is done with Latin, using the marc8_lat_to_unicode table).


if the Escape is to the EACC character set, each of the combining characters that preceded this sequence is translated from MARC-8 to UNICODE and the EACC sequence is translated to Unicode. In other words, the combining characters are not transposed.

The treatment of combining half marks (ligature and double tilde) is:o input: X"EB" (left ligature) <character> X"EC" (right ligature) <

character>o output: <character> X"EB" <character> X"EC" (after which each

character is translated to Unicode)

o input: X"FA" (left double tilde) <character> X"FB" (right double tilde) <character>

o output: <character> X"FA" < character> X"FB" (after which each character is translated to Unicode)

In other words, the combining half marks behave exactly the same as combining characters. Thus, there is no special handling for them.

Whenever single character translation to UNICODE fails as a result of a missing value in a table, this character is translated to X"FFFD" (replacement character, used to replace an incoming character whose value is unknown or unrepresentable in Unicode).

9 Basic Unicode tables (/alephe/unicode)Some of the following tables are used directly by the system, without referring to the alephe/unicode/ tab_character_conversion_line table. These tables must retain the name listed here. Other tables are defined in alephe/unicode/ tab_character_conversion_line, which defines which table to use in each instance where conversion is required. For these tables, the name scheme listed here does not have to be adhered to. However, for easier maintenance of the system, it is recommended that libraries retain the Ex Libris naming conventions.

In all the tables, the Unicode value is in the left-hand column, in hexadecimal notation.

1. unicode_caseThis 3 column table lists the Unicode character mapping in the left-hand column, the corresponding uppercase character in the middle column, and the corresponding lowercase character in the right-hand column. This table is used by the utf_change_case procedure.

2. marc8_heb_to_unicodeThis table maps non-UTF MARC-8 data in Hebrew to UTF-8, for data conversion and import.

3. marc8_ara_to_unicode


Arabic

4. marc8_greek_to_unicodeGreek

5. marc8_rus_to_unicodeCyrillic

6. marc8_lat_to_unicodeThis table maps non-UTF MARC-8 data in Latin to UTF-8, for data conversion and import.

7. marc8_eacc_to_unicodeThis table maps MARC-8 EACC (CJK) data to UTF-8, for data conversion and import.

8. big_five2unicodeThis table maps “big five” (CJK) data to UTF-8, for data conversion and import.

9. cp9362unicodeThis table maps cp936 (CJK) data to UTF-8, for data conversion and import.

10. unicode_to_filing_01_sourceThis table is used for character conversion for filing. These tables must be processed using UTIL P/3 in order to create the unicode_to_filing_01 table. This latter table is the one actually used by the system.Multiple tables (…01, …02) are possible in order to allow multiple conversion tables for filing, in coordination with the library’s tab_filing table. The full “path” of the links to the table used by the system is:- <lib>/tab/tab_filing char_conv line (e.g., FILING-KEY-01)- /alephe/unicode/tab_character_conversion_line (e.g. FILING-KEY-01)

11. unicode_to_word_genThis table is used by the system for character conversion for words in word indexing and for parsing a FIND request (together with procedure 90 in tab_word_breaking).

12. web_unicode_to_sb and web_sb_to_unicode These tables are used for display in WEB browsers that cannot display UTF-8 (i.e., those before Netscape 6 and MS Internet Explorer 5).

13. adm_name_key and adm_name_key.eur

The library might also set up specific tables for character conversion for import and export of data, filing routines, and word breaking routines.

These tables can be used for the ADM data (vendor, course and borrower name key, ADM keyword and ACQ index). The “eur” table is set for European standard for accented characters.


For documentation on how Ex Libris has set up unicode_to_filing_01, unicode_to_word_gen, tab_filing and tab_word_breaking tables, see “How to Understand Headings: Filing Routines, Character Conversion and Normalization”, Appendix 3.

10 Comparison of tables pre-14 and post-14 (from 14.2) (/alephe/char_conv and /alephe/unicode)

Note that before version 14.1, the table and the instance were together, whereas from 14.1 they are separate.

Pre-14.1 table 14 instance 14 conversion table

char_conv.K VENDOR_NAME_KEY adm_name_keychar_conv.K BORROWER_NAME_KEY adm_name_key char_conv.N ACQ_INDEX acq_index char_conv.1 no parallel required char_conv.2 no parallel required char_conv.3 no parallel required char_conv.4 no parallel required char_conv.S LOCATE unicode_to_locatechar_conv.A see “Filing” section below ----char_conv.L required for general purposes unicode_casechar_conv.U required for general purposes unicode_case

Many of the ALEPH tables include a column for “Alpha” (i.e., character set). Now that ALEPH is Unicode compatible, this indicator is no longer meaningful and should be set to “L”. Possibly, there are instances where “C” (for Chinese) is required, but this will be in rare cases at only a few installations.

11 Cataloging In order to input characters that are not included in the font for the defined character set, the cataloger can use “F11 + numeric value”. The numeric hexadecimal value, and not the numeric decimal value, must be entered.

12 Filing and Word BreakingIn previous versions, filing and word breaking routines were system-set and used char_conv.A. From version 14.1, filing and word breaking routines are made up of facets, which are defined in tables. One of the facets is the character conversion table that should be used for the routine. Therefore, there can be multiple character conversion tables for filing and word breaking.

Filing depends on definitions set in the library’s tab_filing table, and word breaking depends on definitions set in the library’s tab_word_breaking table. © Ex Libris Ltd., 2002 of 14Application of UnicodeLast updated for Version 16 August 18, 2003

application of unicode€¦ · web viewfirst written oct. 5, 2000 for 14.2 revised in may 2001...

Documents