true scripts in library catalogs

13
ALCTS "Library Catalogs and Non-Roman Scripts" Program ALA Annual Conference Orlando www.ala.org/alcts 1 True Scripts in Library Catalogs – The Way Forward Joan M. Aliprand Senior Analyst, RLG © 2004 RLG Why the current limitation? Coordination! As complex as Format Integration!

Upload: vuongxuyen

Post on 21-Dec-2016

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 1

True Scripts in Library Catalogs – The Way Forward

Joan M. AliprandSenior Analyst, RLG

© 2004 RLG

Why the current limitation?

•Coordination!•As complex as

Format Integration!

Page 2: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 2

Script Capability and Data Exchange

Before Unicode

System A

JACKPHYcapable

System B

CJKcapable

CJK data (EACC)�

HAPY data (Hebrew, Arabic)

can only use theromanized data

• JACKPHY = East Asian scripts (CJK), Arabic, and Hebrew• CJK = scripts for Chinese, Japanese, Korean• HAPY = scripts for Hebrew, Arabic, Persian, Yiddish

� Latin data (ASCII, ANSEL)

Script Capability and Data Exchange

With unrestrained Unicode

System U+

Unicodecapable

System A

JACKPHYcapable

CJK data�

HAPY data

can only use the romanized data

• JACKPHY = East Asian scripts (CJK), Arabic, and Hebrew • CJK = scripts for Chinese, Japanese, Korean• HAPY = scripts for Hebrew, Arabic, Persian, Yiddish

� Latin data

?

?

?Other non-Roman data

Page 3: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 3

Script Capability and Data Exchange

With unrestrained Unicode

System U+

Unicodecapable

System A

JACKPHYcapable

> CJK data

> HAPY data

>…data – Outside the scope of the individual MARC 21 sets

���� – Can only use the romanized data in the record

� Latin data ?

Other non-Roman data

CHARACTERS OF THE MARC 21 RECORD

“MARC-8”• Latin character sets (default)

• Basic Latin “ASCII” (US standard ANSI X3.4)• Extended Latin “ANSEL” (ANSI/NISO Z39.47)• Euro sign € and Eszett ß for UKMARC alignment

• Supplementary character sets• Superscripts• Subscripts• Greek symbols � � �

• Non-Roman character sets• Arabic, Cyrillic, EACC (CJK), Greek, Hebrew

Page 4: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 4

CHARACTERS OF THE MARC 21 RECORD

Use of Unicode• UTF-8 encoding form• Designation of Unicode as encoding

– Leader position 09, Character coding scheme• Single character set, so no 066 field, Character Sets Present

• Initial limitation to MARC-8 character repertoire • MARBI Proposal 98-18

• Exception for Canadian Aboriginal Syllabics• MARBI Proposal 2002-11

– Encoded only as Unicode characters• Addition to MARC-8 sets not requested

Character Sets in the MARC 21 Record

At least one non-Roman character set required; ASCII

Alternate Graphic Representation (880)

Basic Latin, Extended Latin plus € and ß, Superscripts, Subscripts, Greek symbols

Data Fields (except 880)

ASCIIControl Fields (00X)

ASCII numeric charactersDirectory

ASCII graphic charactersLeader

Character Sets AllowedRecord Component

Page 5: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 5

Unicode in the MARC 21 Record

All

Basic Latin, Extended Latin plus € and ß, Superscripts, Subscripts, Greek symbols

ASCII

ASCII numeric characters

ASCII graphic characters

MARC-8 Character Sets

At least one Script other than Latin; ASCII characters

Latin | Common | Inherited

ASCII graphic characters

ASCII numeric characters

ASCII graphic characters

Unicode (UTF-8)

Alternate Graphic Representation (880)

Data Fields (except 880)

Control Fields (00X)

Directory

Leader

Record Component

Accented Letters?

• Two alternatives:– Strictly decomposed only?

• Base letter followed by combining characters

– Both composite characters and combined character sequences permitted

• Use Unicode-defined canonical equivalence and canonical ordering

Page 6: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 6

The Order of Data

• Sorting; Filing order; Collation• Not only for presentation of results

– Essential part of query matching

• Sorting by character code?– A primitive approach– DOES NOT WORK FOR UNICODE

��������������� ���������������������� ���������������������� ���������������������� �������

• Default order for all characters of each script• Four levels of differences

• Diacritics significant at Secondary level• Case significant at Tertiary level

• “Tailoring” methodology provided– For alphabetical order of a particular language– For library practices

• Does not do everything– Separate character folding needed in some cases

• Medial, final forms in Hebrew; CJK ideographic variants

Page 7: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 7

True Scripts in Authority Records

One of the factors that continues to delay the implementation of original scripts in NACO authority data is that not all places that support copies of NAF for input/update of NACO records can yet cope with all of the original scripts that might be included in these records. In order to maintain virtually identical copies of NAF, users are not allowed to create headings with original scripts that might be invisible or deleted in other copies of the file.

– Ed Glazier, RLG

Script Capability and Authority Records

LSP ParticipantAs of July 1987

OCLCRLGLCScript

���Latin

�Cyrillic

��CJK

Page 8: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 8

MARC 21 Format for Authority Records

• Data elements for non-Roman scripts– Same as in Format for Bibliographic Records

• Alternate graphic representation (880 field)• Character sets specified (066 field)• Subfield 6 for field linkage, directionality

– Assumption of 1:1 field equivalence– Not yet implemented in NAF, SAF

• Linkage between established headings– 7XX Heading Linking Entry fields

Excerpt from NAF Record

Arabic (ALA-LC romanization)410 20 ‡a Umam al-Muttahidah

Chinese (Wade-Giles romanization)410 20 ‡a Lien ho kuo

Chinese in pinyin (ALA-LC romanization)410 20 ‡a Lien he guo

Russian (ALA-LC romanization)410 20 ‡a Organizatsiia

Ob˝edinennykh Natsi�

Spanish410 20 ‡a Naciones Unidas

French410 20 ‡a Nations Unies

English110 20 ‡a United Nations

.

Page 9: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 9

110 20 ‡a United Nations410 20 ‡a Nations Unies 410 20 ‡a Naciones Unidas410 20 ‡a Organizatsiia Ob˝edinennykh Natsi�410 20 ‡a Lien he guo410 20 ‡a Lien ho kuo410 20 ‡a Umam al-Muttahidah

880 20 ‡6 410-00 ‡a ��������� �������������

880 20 ‡6 410-00 ‡a ���

880 20 ‡6 410-00/r ‡a����������

Excerpt with True Scripts

.

110 2_ ‡a

United Nations

110 2_ ‡a

Naciones Unidas

110 2_ ‡a

�������� � ���������

����

110 2_ ‡a

Nations Unies 110 2_ ‡a

���

110 2_ ‡a

���������

710

710

Page 10: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 10

Linkage within Authority Records

• Romanized/non-Roman field pairs?– Need to see all data in authority record – Only one source of name or subject authority

for 1XX – 1:many , many:1 relationships in authorities

Linkage within Authority Records

• Romanized/non-Roman field pairs?– Need to see all data in authority record – Only one source of name or subject authority

for 1XX – 1:many , many:1 relationships in authorities

� UNLINKED 880 FIELDS �

Page 11: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 11

Linkage within Authority Records

• Romanized/non-Roman field pairs?• Need to see all data in authority record • Only one source of name or subject authority for 1XX • 1:many , many:1 relationships in authorities

� UNLINKED 880 FIELDS �

• Any need to use 880 fields?– Substitution of data as for bibliographic display does

not apply– To simplify checking of field contents?– To simplify implementation?

Characters in Authority Records

Rule for Use in Authority RecordsCharacter

UseCurly brackets (open & close)

Use when availableInverted exclamation mark

Use when availableInverted question mark

Substitute number sign for nowSharp

Pass through; do not actively supplyCopyright mark

Pass through; do not actively supplyPhono copyright mark

Pass through; do not actively supplyLowercase script l

Substitute superscript 0 for nowDegree sign

Substitute %7E for nowSpacing tilde

Should not occurSpacing grave

Substitute %5F for nowSpacing underscore

Substitute %5E for nowSpacing circumflex

Rules for 13 Characters as of 01/06/2001

Page 12: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 12

Version 3.0 – 49,1942002

Version 2.0 - 38,8851996

Version 1.1 – 34,1691992

17,020

17,018

16,386

16,373

16,200

16,122

15,987

249

176

145

Cumulative Total

2

632

13

173

78

135

15,738

73

31

145

Graphic Chars.

Additional1968

CAS (UTF-8); Euro, Eszett

2002

Version 4.0 – 96,382-2003

Latin1994

Version 1.0 - 28,302Arabic1991

Hebrew1988

Cyrillic1986

EACC198x

Greek198x

Default1968

UnicodeGraphic Characters

MARC 21Character Sets

Date

�����������

“Unicode” is not a magic word• Shorthand for The Unicode Standard• Potential for multiscript support• Unicode is not software

– Underlying software comes from OEMs

• Unicode is the foundation, not the whole edifice– Many parts to a system

Page 13: True Scripts in Library Catalogs

ALCTS "Library Catalogs and Non-Roman Scripts" Program

ALA Annual Conference Orlando

www.ala.org/alcts 13

Corporate MembersAdobe Systems; Agence Intergouvernementale de la Francophonie; Apple Computer; Basis Technology; Hewlett-Packard; IBM; India (Ministry of Information Technology); Justsystem; Microsoft; Oracle; Pakistan (National Language Authority); PeopleSoft; RLG; SAP; Sun; Sybase

Associate Members33 Associate Members including:The Church of Jesus Christ of Latter-day Saints; Columbia University; Endeavor Information Systems; Ex Libris; Innovative Interfaces; The Library Corporation; OCLC; SIL International; SIRSI; VTLS