lis512 lecture 3 numbers and characters

46
LIS512 lecture 3 numbers and characters Thomas Krichel 2010 – 10 – 06

Upload: tuyet

Post on 25-Feb-2016

41 views

Category:

Documents


2 download

DESCRIPTION

Thomas Krichel 2010 – 10 – 06. LIS512 lecture 3 numbers and characters. structure. numbers numeric information character information the ASCII set Unicode encoding coda ligatures collations transliterations. introduction. We have seen that databases store records. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LIS512  lecture  3 numbers and  characters

LIS512 lecture 3

numbers and characters

Thomas Krichel2010 – 10 – 06

Page 2: LIS512  lecture  3 numbers and  characters

structure

• numbers• numeric information• character information• the ASCII set• Unicode• encoding• coda– ligatures – collations– transliterations

Page 3: LIS512  lecture  3 numbers and  characters

introduction

• We have seen that databases store records.• Records contain fields, fields have values.• Here we talk about fundamentally, how do we

compose those values.– Numeric values are easy.– Character values are harder.

Page 4: LIS512  lecture  3 numbers and  characters

literature

• The library textbooks are hopelessly short and confused about this topic.

• I have most of what I have here from my own experience.

• I recommend Wikipedia, it has fascinating articles about these topics.

Page 5: LIS512  lecture  3 numbers and  characters

all gone to a number

• In all modern information system, information is stored to be treated on a computer.• A computer can only deal with numbers.• As a consequence all information has to be

converted into a number.• It's a huge job. • Let’s look at the ground, numbers.

Page 6: LIS512  lecture  3 numbers and  characters

binary code

• All computers process codes that is a sequence or zeros and ones.

• Such code is called binary. • All digital information is somewhere written

out as sequences of on/off binary signals.

Page 7: LIS512  lecture  3 numbers and  characters

binary and digital numbers

• Binary numbers can be converted to normal “decimal” numbers– 0 0– 1 1– 10 2– 11 3– 100 4– 101 5

• etc

Page 8: LIS512  lecture  3 numbers and  characters

a bit

• A bit is the elementary unit of information.• It takes a binary value. We can label it

true/false, black/white, +/-, etc.• Every piece of information in all modern

information storage systems has to be reduced to a sequence of bits.• We will denote them 0/1 here.

Page 9: LIS512  lecture  3 numbers and  characters

example: 2 bits

• If we say that an piece of data is three bits long, we know it can hold 2 to the power 2 different numbers.

• In binary, they are 00, 01, 10, 11.• In decimal, they are 0, 1, 2, 3.

Page 10: LIS512  lecture  3 numbers and  characters

byte

• A byte is a sequence of 8 bits. '00000000' to '11111111'. There are 2 to the power 8, meaning 256 possibilities to write a byte. • If the byte is required to start with 0, then we

can only write '0000000' to '01111111'. This leaves us with 2 to the power 7, meaning 128 possibilities.

Page 11: LIS512  lecture  3 numbers and  characters

hex numbers

• Hex numbers contain the usual digits 0 to 9, as well as A to F. A means 10, B means 11, etc F means 15.• One hex number can represent 2 to the power

4, meaning 16 possibilities (0 to 15). • Two hex numbers can represent 2 to the

power 8 possibilities.

Page 12: LIS512  lecture  3 numbers and  characters

one hex number

0 0000 | 1 00012 0010 | 3 00114 0100 | 5 01016 0110 | 7 01118 1000 | 9 1001a 1010 | b 1011c 1100 | d 1101e 1110 | f 1111

Page 13: LIS512  lecture  3 numbers and  characters

same in decimal number

0 0000 | 1 0001 2 0010 | 3 0011 4 0100 | 5 0101 6 0110 | 7 0111 8 1000 | 9 100110 1010 | 11 101112 1100 | 13 110114 1110 | 15 1111

Page 14: LIS512  lecture  3 numbers and  characters

bytes and hex numbers

• Since two hex numbers convene the same number of possibilities as a byte a byte is often represented as two hex numbers. • Thus, for example• '00000000' in binary is 00 in hex, • '11111111' in binary is 'FF' in hex,• '01111111' in binary is ‘7F‘ in hex

Page 15: LIS512  lecture  3 numbers and  characters

another way to see this…

• binary numbers are sometimes called base-2 numbers.

• decimal number are sometimes called base-10 numbers.

• so hexadecimal (hex) numbers are just base-16 numbers.– sometimes written using a 0x prefix.

Page 16: LIS512  lecture  3 numbers and  characters

converting information to numbers

• A lot of problem in converting information comes from some part of the information encode in some form and some other part in some other from.

• Example: “15 Julliet 1923” vs “July 17, 1923”• Often such inconsistencies require manual

reformatting, which is very expensive.

Page 17: LIS512  lecture  3 numbers and  characters

numerizing

• In the design of every information systems, if possible it is a good idea to convert information into something that is directly a number.

• There are examples where it is possible directly use a number, such as– colours – times and dates– locations.

Page 18: LIS512  lecture  3 numbers and  characters

example: colours on the web• Colours on the world wide web follow the

red/green/blue colour model.• Each colour is given as a number #rrggbb,

where rr is the amount of red gg is the amount of green and bb in the amount of blue. All these numbers are hex numbers. Example– #FFFFFF white– #00FFFF aqua

Page 19: LIS512  lecture  3 numbers and  characters

example: times

• One way to “numerize” recent times in to take the number of seconds since the first of January 1970.

• This point in time is called the Unix epoch. • Counting time like this has the advantage that

it is straightforward to interpret and to convert it into a representation of time that the user can understand.

Page 20: LIS512  lecture  3 numbers and  characters

example: location

• On earth, locations are best given by a longitude / latitude grid system.

• This for example, makes a rough calculation possible on how far two points are apart.

• It also allows us to refer to a location independently of its current name. Remember, names of locations change.

Page 21: LIS512  lecture  3 numbers and  characters

non-numerical information

• A lot of information is not numerical by its nature. For example – the name of a person– the title of an expression of a work• The information is of a character string nature.• To store character strings in an information

system, each character has to be converted to a number.

Page 22: LIS512  lecture  3 numbers and  characters

character

• A character is an indivisible unit of textual information.

• Textual information is composed of characters, and nothing else.

Page 23: LIS512  lecture  3 numbers and  characters

characters and computer• Computers can not deal with characters

directly. They can only deal with numbers.• There we need to associate a number with

every character that we want to use in an information encoding system. • A character set combines characters with

number.

Page 24: LIS512  lecture  3 numbers and  characters

ASCII

• ASCII is an old character set developed in the United States. It is a seven bit character set.• In hex notation, it goes from '00' to '7F' • Because Anglo-Saxon cultural imperialism,

most other character set either include or extend the ASCII character set.

Page 25: LIS512  lecture  3 numbers and  characters

notable characters in ASCII

decimal hex byte • 8 8 08 U+0008 backspace• 9 9 09 U+0009 horizontal tab• 10 A 0A U+000A line feed• 13 D 0D U+000D carriage return• 32 20 20 U+0020 space• 127 7F 7F U+007F delete

Page 26: LIS512  lecture  3 numbers and  characters

UCS / Unicode

• UCS is a universal character set. • It is maintained by the International Standards

Organization.• Unicode is an industry standard for characters.

It is better documented than UCS.• For what we discuss here, UCS and Unicode

are the same.

Page 27: LIS512  lecture  3 numbers and  characters

Basic multilingual plane

• This is a name for the first 65536 characters in Unicode.• Each of these characters fits into two bytes

and is conveniently represented by four hex numbers.• Even for these characters, there are numerous

complications associated with them.

Page 28: LIS512  lecture  3 numbers and  characters

wikipedia notation

• Wikipedia denotes every character in the BMP as U+hhhh where h is a hex digit 0-F.

• We will follow this notation here. • This notation is also useful when you try to enter

the characters on a computer. • For example, in MS Windows, you • press and hold ALT• press + on the numeric keypad• enter the hex code• release ALT

Page 29: LIS512  lecture  3 numbers and  characters

ascii and unicode

• The first 128 characters of UCS/Unicode are the same as the ones used by ASCII.

• So you can think of UCS/Unicode as an extension of ASCII.

Page 30: LIS512  lecture  3 numbers and  characters

dashes

• figure dash ‒ U+2012 to link numbers without a range

• en dash – U+2013 to link numbers with a range

• em dash — U+2014 for interjections in a sentence

• minus sign − U+2212 for mathematics

Page 31: LIS512  lecture  3 numbers and  characters

“smart” quotes

• U+201c “ is the opening double quote• U+201d ” is the closing • U+2019 ’ is the apostrophe• The single quote of the ASCII character set is

considered to be of mixed usage, it should be avoided when a specific use can be done.

• Similarly, the double quote of the ASCII character set is imprecise.

Page 32: LIS512  lecture  3 numbers and  characters

spaces

• non-breaking space, U+00A0 is used when you want to avoid a line break between the two spaced items. For example in hyperlink text, it is good practice to replace spaces with non-breaking spaces as to avoid there appearing to be two links. • In whitespace collapsing contents, it can also

be use to add extra space.

Page 33: LIS512  lecture  3 numbers and  characters

in foreign languages

• Everything becomes difficult. • As an example consider the characters

– o– ő– ö

• The latter two can be considered o with diarcitics or as separate characters.

Page 34: LIS512  lecture  3 numbers and  characters

most problematic: encoding

• One issue is how to map characters to numbers.

• This is complicated for languages other than English.

• But assume UCS/Unicode has solved this.• But this is not the main problem that we

have when working with non-ASCII character data.

Page 35: LIS512  lecture  3 numbers and  characters

encoding

• The encoding determines how the numbers of each character should be put into bytes.

• If you have a character set that is has one byte for each character, you have no encoding issue.

• But then you are limited to 256 characters in your character set.

Page 36: LIS512  lecture  3 numbers and  characters

fixed-length encoding

• If you have a fixed length encoding, all characters take the same number of bytes.

• Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that.

• If you are writing only ASCII, it appears a waste.

Page 37: LIS512  lecture  3 numbers and  characters

variable length encoding

• The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8.

• It is important to understand that the encoding needs to known and correct.

Page 38: LIS512  lecture  3 numbers and  characters

bascis of UTF-8

• Every ASCII character, represented as a byte, starts with a zero.

• Characters that are beyond ASCII require two or three bytes to be complete.

• The first byte will tell you how many bytes are coming to make the character complete.

Page 39: LIS512  lecture  3 numbers and  characters

byte shapes in UTF-8 encoding

• 0?????? ASCII• 110???? first octet of two-byte character • 1110???? first byte of three-byte character • 11110??? first octet of four-byte character• 10??????? byte that is not the first byte• as you can see, there are sequences of bytes

that are not valid

Page 40: LIS512  lecture  3 numbers and  characters

hex range to UTF-8

• 0000 to 007F 0???????• 0080 to 07FF 110????? 10??????• 0800 to FFFF 1110???? 10?????? 10??????

Page 41: LIS512  lecture  3 numbers and  characters

ligature

• In fine traditional typography, certain characters appear to be linked to each other.

• The most command examples in English usage are fi, ff, fl, ffi, ffl.

Page 42: LIS512  lecture  3 numbers and  characters

ligatures growing up

• In certain cases, ligatures have become so common that they have become characters of their own.• A prominent example is the German sz ligature

the esszet. It looks a bit like a beta because it is derived from the fraktur font of the characters.• Another example, apparently, is &.

Page 43: LIS512  lecture  3 numbers and  characters

collations

• Collations are topic that is related to characters. • A collation is a sorting order of character

strings. • You may think this is trivial, just follow the

alphabetic order.• But in many languages, diacritics come to

complicate matters.

Page 44: LIS512  lecture  3 numbers and  characters

example German• Here are the extra letter of German: Ä/ä, Ö/ö,

Ü/ü, ß • In German, there are two collations.– DIN 5007-1 “dictionary collation” treats

umlauted characters as if they did not have them, and ß as s.

– DIN 5007-2 “phonebook collation” treats umlauted as letter and e (ex. ä --> ae), and ß as ss

Page 45: LIS512  lecture  3 numbers and  characters

• When non-English characters are supposed to be entered in a system used by English speaking people, a transliteration might be used.

• This can also be the case if the original script may not be commonly understood. An example are Japanese road sign.

• Wikipedia lists 20 different ways to do that for Russian, say. Library of Congress scheme is apparently the most widely used.

transliterations

Page 46: LIS512  lecture  3 numbers and  characters

http://openlib.org/home/krichel

Thank you for your attention!

Please switch off machines b4 leaving!