lis508 lecture 1: bits, bytes and characters thomas krichel 2002-09-23

21
lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Upload: caleb-garza

Post on 27-Mar-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

lis508 lecture 1: bits, bytes and characters

Thomas Krichel

2002-09-23

Page 2: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Structure

• Bits

• Bytes

• Character sets– Coded character set– Character endcoding

Page 3: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Literature

• Norton “new inside the PC” chapter 4

• http://www.danbbs.dk/~erikoest/bb_terms.htm

• http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html

• http://www.cl.cam.ac.uk/~mgk25/unicode.html

Page 4: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Information

• Information is best understood as “what it takes to answer a question”.

• The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information.

• Term first used by John Turkey in 1946.

• Concatenation of “binary digit”.

Page 5: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Usage of bits

• Computers are sometimes classified by – The number of bits they can process at one

time i.e. the register size. Larger registers make a computer run faster.

– The number of bits they use to represent addresses i.e. address size. A larger address size allows to run larger programs.

• Graphics are also often described by the number of bits used to represent each dot.

Page 6: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Many bits

• The first chips used to process 8 bits at a time. It become customary to refer to them as a byte.

• Larger units are– Kilo byte is 2 power 10 bytes – Mega bytes is 2 power 20 bytes– Giga bytes is 2 power 30 bytes– Tera byte is 2 power 40 bytes

• From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.

Page 7: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

More than a monster

• In 1975, the General Conference of Weights and Measures (CGPM), based at Sèvres near Paris, agreed to add peta- (P) and exa- (E)

• Petabyte is 2 power 50 bytes

• Exabyte in 2 power 60

• Nowadays they are followed by yottabyte (70) and zettabyte (80)

Page 8: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Hex numbers• A byte is often represented by two hex

numbers.

• Each hex number can encode 16 values

• Written 0 to 9, then A B C D E F. F is 15.

• Here, prefixed with 0x

• Use Microsoft calculator with scientific notation to convert.

Page 9: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

decimal/binary numbers

• 0 0• 1 1• 2 10• 3 11• 4 100• 5 101• 6 110• 7 111

• 8 1000• 9 1001• 10 1010• 11 1011• 12 1100• 13 1101• 14 1110• 15 1111

Page 10: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Characters

• Much of the information processed by computers is in the form of characters.

• A character only makes sense for a human user of a minimum cultural level.

• A character is not a glyph.– ligatures

Page 11: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Representing characters

• Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a coded character set.

• Important examples are– ASCII– ISO 8859--1– cp1252

Page 12: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

ASCII

• American Standard Code for Information Interchange

• 7-bit character set. There is no such thing as 8-bit ASCII

• 95 printable symbols

• 33 control characters (0-31, 127)

• http://www.ccmr.cornell.edu/helpful_data/ascii2.html has a list.

Page 13: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

ASCII control codes• ACK (6, ^F) used to acknowledge receipt of

message, NAK (21, ^U) used to signal non-receipt

• CR (13, ^M) is the carriage return• LF (10, ^J) is the linefeed • FF (12, ^L) is the form feed (new page)• BS (8, ^H) is the backspace • DEL (ALT-127) is delete• ESC (^[) escapeDifferent programs use them in different ways, a

big pain in the a…

Page 14: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

ISO-8859-1

• PCs work with bytes, so manufactures were free to fill the other 128 characters.

• ISO-8859-1, aka ISO-latin-1, it extends ASCII with characters that are used by the western European languages.

• It is the default character set of html.

• Positions 128 to 159 are not used.

• Cp1252 fills these with graphic chars.

Page 15: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Three concepts for characters

• Abstract Character Repertoire: the set of characters to be encoded, e.g., some alphabet or symbol set

• Coded Character Set : a mapping from an abstract character repertoire to a set of non-negative integers

• Character Encoding Scheme: a mapping from a coded character set to a serialized sequence of bytes

Page 16: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

ISO 10646-1

• Defines the Universal Character Set (UCS)• UCS contains the characters required to

represent characters used by practically all known languages, even the likes of Gurmukhi, Oriya, Telugu, Bopomofo, Runic.

• There are proposals for more, like Hieroglyphs and Tengwar.

• Note that there are about 6800 known languages.

.

Page 17: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

UCS organization

• ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars.

• The canonical form of ISO 10646 uses a four-dimensional coding space consisting of 256 groups. Each group consists of 256 planes with each plane containing 256 rows, each having 256 cells.

Page 18: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

UCS organization

• The first plane (Plane 0x00) of Group (0x00) is called the Basic Multilingual Plane (BMP). It has been fixed since first publication.

• The subsequent 223 planes (0x01 to 0xDF) of Group 0x00, as well as planes 0x00 to 0xFF in Groups 0x01 to 0x5F are reserved for further standardization.

• The last 32 planes (0xE0 to 0xFF) of Group 0x00, as well as all code positions of 32 groups (0x60 to 0x7F) are reserved for private use.

Page 19: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Relationship with legacy sets

• Let U+(four hex numbers) denote characters in the BMP.

• The UCS characters U+0000 to U+007F are identical to those in ASCII

• The range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1).

Page 20: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

Types of characters in UCS

• Letters– Base characters– Ideographic characters– Combining characters

• Digits

• Extenders

Page 21: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

http://openlib.org/home/krichel

Thank you for your attention!