dale & lewis chapter 3 data representation. analog and digital information the real world is...

Dale & Lewis Chapter 3Data Representation

Analog and digital information

• The real world is continuous and finite, data on computers are finite need to approximate real-world data for our computational needs

• Analog data: information represented in a continuous form• Digital data: information represented in digital form

Analog and digital information

Noise in signals

Digitizing a signal

• Sample the signal in time within discrete levels• The pieces are numbered• The binary number system is used to represent the

numbers• n bits can represent 2n numbers• Q: how many bits are needed to represent m numbers?• Actual number of bits that can be easily addressed in a

computer sets some constraints

Representing text

• English language character set: 26 letters (both upper and lower case), punctuation, numeric digits, etc

• How many bits can we use?• What about other languages?

ASCII character set

• American Standard Code for Information Interchange• Each character is coded as a byte (8 bits)• 7-bit code (1 check bit)

− Later all 8 bits used in the “extended character set”− 128 characters encoded (27)

95 visible characters 33 invisible (control) characters

7-bit ASCII character set

ASCII Table

• The table above was sorted in decimal values• These decimal values are really representing binary

sequences• So the character J is in position 74

− This would be 01001010 in Binary or 4A in Hexadecimal

− j in 106 is 01101010 in Binary or 6A in Hexadecimal

− Notice anything? There is a purpose for that!

• The Unicode character set− 16-bit standard, 65,536 possible codes− Enough to cover the principal languages of the World− Superset of ASCII so the first 256 codes of Unicode are the

same as Extended ASCII

Text compression

• Keyword encoding− Substitute frequently used words with single characters− i.e.: “as” ^, “the” ~, “and” +, “that” $, etc.− Problems:

These characters can’t be part of the text Frequently used words tend to be short, so not much gain Word variations are not handled: i.e. “The” vs. “the”

Run-length encoding

• Replace long series of a repeated character with a special short code− i.e.: replace “AAAAAAA” with *A7− This is equivalent to 01000001 01000001 01000001 01000001 01000001 01000001 01000001 with 00101010 01000001 00000111

− Note that repetitions shorter than 4 characters are not worth encoding

− Also note that the repetition number is encoded in binary, not ASCII, so that repetitions longer than 9 can be captured

• Used in limited-palette image compression and fax machines

Huffman encoding

• Generalization of Morse Code• Morse code (dots & dashes) is based on distribution of

letters in general English usage• Huffman encoding in based on distribution in a given

message• Algorithm:

− Encoding: Build frequency table of letter usage Build the code and encode the message

− Decoding Huffman code has the prefix property Prefix property: no code is the front part of another code Decoding processes the bit stream until a match

is found

Example of Huffman encoding/decoding

• Message: DOORBELL• Encoding: 1011110110111101001100100• Compression ratio (vs ASCII): 25/64 = 0.39• Decode:

dale & lewis chapter 3 data representation. analog and digital information the real world is...

Documents