dale & lewis chapter 3 data representation. analog and digital information the real world is...
TRANSCRIPT
Dale & Lewis Chapter 3Data Representation
Analog and digital information
• The real world is continuous and finite, data on computers are finite need to approximate real-world data for our computational needs
• Analog data: information represented in a continuous form• Digital data: information represented in digital form
Analog and digital information
Noise in signals
Digitizing a signal
• Sample the signal in time within discrete levels• The pieces are numbered• The binary number system is used to represent the
numbers• n bits can represent 2n numbers• Q: how many bits are needed to represent m numbers?• Actual number of bits that can be easily addressed in a
computer sets some constraints
Representing text
• English language character set: 26 letters (both upper and lower case), punctuation, numeric digits, etc
• How many bits can we use?• What about other languages?
ASCII character set
• American Standard Code for Information Interchange• Each character is coded as a byte (8 bits)• 7-bit code (1 check bit)
− Later all 8 bits used in the “extended character set”− 128 characters encoded (27)
95 visible characters 33 invisible (control) characters
7-bit ASCII character set
ASCII Table
• The table above was sorted in decimal values• These decimal values are really representing binary
sequences• So the character J is in position 74
− This would be 01001010 in Binary or 4A in Hexadecimal
− j in 106 is 01101010 in Binary or 6A in Hexadecimal
− Notice anything? There is a purpose for that!
• The Unicode character set− 16-bit standard, 65,536 possible codes− Enough to cover the principal languages of the World− Superset of ASCII so the first 256 codes of Unicode are the
same as Extended ASCII
Text compression
• Keyword encoding− Substitute frequently used words with single characters− i.e.: “as” ^, “the” ~, “and” +, “that” $, etc.− Problems:
These characters can’t be part of the text Frequently used words tend to be short, so not much gain Word variations are not handled: i.e. “The” vs. “the”
Run-length encoding
• Replace long series of a repeated character with a special short code− i.e.: replace “AAAAAAA” with *A7− This is equivalent to 01000001 01000001 01000001 01000001 01000001 01000001 01000001 with 00101010 01000001 00000111
− Note that repetitions shorter than 4 characters are not worth encoding
− Also note that the repetition number is encoded in binary, not ASCII, so that repetitions longer than 9 can be captured
• Used in limited-palette image compression and fax machines
Huffman encoding
• Generalization of Morse Code• Morse code (dots & dashes) is based on distribution of
letters in general English usage• Huffman encoding in based on distribution in a given
message• Algorithm:
− Encoding: Build frequency table of letter usage Build the code and encode the message
− Decoding Huffman code has the prefix property Prefix property: no code is the front part of another code Decoding processes the bit stream until a match
is found
Example of Huffman encoding/decoding
• Message: DOORBELL• Encoding: 1011110110111101001100100• Compression ratio (vs ASCII): 25/64 = 0.39• Decode: