data compression. how file compression works if you download many programs and files off the...

26
Data Compression Data Compression

Upload: shana-evans

Post on 31-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Data CompressionData Compression

Page 2: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

How File Compression How File Compression Works Works

If you download many programs and files off the If you download many programs and files off the Internet, you've probably encountered ZIP files Internet, you've probably encountered ZIP files before. This compression system is a very handy before. This compression system is a very handy invention, especially for Web users, because it lets invention, especially for Web users, because it lets you reduce the overall number of you reduce the overall number of bits and bytesbits and bytes in a in a file so it can be transmitted faster over slower file so it can be transmitted faster over slower Internet connections, or take up less space on a Internet connections, or take up less space on a disk.disk.

Once you download the file, your computer uses a Once you download the file, your computer uses a program such as WinZip or Stuffit to expand the file program such as WinZip or Stuffit to expand the file back to its original size. If everything works back to its original size. If everything works correctly, the expanded file is identical to the correctly, the expanded file is identical to the original file before it was compressed. original file before it was compressed.

Page 3: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Finding RedundancyFinding Redundancy

Most types of computer files are fairly Most types of computer files are fairly redundant -- they have the same information redundant -- they have the same information listed over and over again. listed over and over again.

File-compression programs simply get rid of File-compression programs simply get rid of the redundancy. the redundancy.

Instead of listing a piece of information over Instead of listing a piece of information over and over again, a file-compression program and over again, a file-compression program lists that information once and then refers lists that information once and then refers back to it whenever it appears in the original back to it whenever it appears in the original program program

Page 4: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Finding RedundancyFinding Redundancy

As an example, let's look at a type of information As an example, let's look at a type of information we're all familiar with: words. we're all familiar with: words.

In John F. Kennedy's 1961 inaugural address, he In John F. Kennedy's 1961 inaugural address, he delivered this famous line: delivered this famous line:

"Ask not what your country can do for you -- ask "Ask not what your country can do for you -- ask what you can do for your country."what you can do for your country."

The quote has 17 words, made up of 61 letters, The quote has 17 words, made up of 61 letters, 16 spaces, one dash and one period. If each 16 spaces, one dash and one period. If each letter, space or punctuation mark takes up one letter, space or punctuation mark takes up one unit of memory, we get a total file size of 79 units. unit of memory, we get a total file size of 79 units. To get the file size down, we need to look for To get the file size down, we need to look for

redundancies. redundancies.

Page 5: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Finding RedundancyFinding Redundancy

Immediately, we notice that: Immediately, we notice that: "ask" appears two times "ask" appears two times "what" appears two times "what" appears two times "your" appears two times "your" appears two times "country" appears two times "country" appears two times "can" appears two times "can" appears two times "do" appears two times "do" appears two times "for" appears two times "for" appears two times "you" appears two times "you" appears two times

Page 6: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Finding RedundancyFinding Redundancy

Ignoring the difference between capital Ignoring the difference between capital and lower-case letters, roughly half of the and lower-case letters, roughly half of the phrase is redundant. Nine words -- ask, phrase is redundant. Nine words -- ask, not, what, your, country, can, do, for, you not, what, your, country, can, do, for, you -- give us almost everything we need for -- give us almost everything we need for the entire quote. the entire quote.

To construct the second half of the phrase, To construct the second half of the phrase, we just point to the words in the first half we just point to the words in the first half and fill in the spaces and punctuation. and fill in the spaces and punctuation.

Page 7: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Looking it UpLooking it Up

Most compression programs use a variation of the Most compression programs use a variation of the LZ adaptive dictionary-based algorithmLZ adaptive dictionary-based algorithm to to shrink files. "LZ" refers to shrink files. "LZ" refers to Lempel and ZivLempel and Ziv, the , the algorithm's creators, and "dictionary" refers to algorithm's creators, and "dictionary" refers to the method of the method of catalogingcataloging pieces of data. pieces of data.

The system for arranging dictionaries varies, but The system for arranging dictionaries varies, but it could be as simple as a numbered list. it could be as simple as a numbered list.

When we go through Kennedy's famous words, When we go through Kennedy's famous words, we pick out the words that are repeated and put we pick out the words that are repeated and put them into the numbered index. them into the numbered index.

Then, we simply write the number instead of Then, we simply write the number instead of writing out the whole word. writing out the whole word.

Page 8: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Looking it UpLooking it Up

So, if this is our dictionary: So, if this is our dictionary:

1.1. ask ask 2.2. what what 3.3. your your 4.4. countrcountr

y y 5.5. can can 6.6. do do 7.7. for for 8.8. youyou

Page 9: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Looking it UpLooking it Up

Our sentence now reads: Our sentence now reads:

"1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"

Page 10: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Looking it UpLooking it Up

If you knew the system, you could easily If you knew the system, you could easily reconstruct the original phrase using only this reconstruct the original phrase using only this dictionary and number pattern. dictionary and number pattern.

This is what the expansion program on your This is what the expansion program on your computer does when it expands a computer does when it expands a downloaded file. You might also have downloaded file. You might also have encountered compressed files that open encountered compressed files that open themselves up. themselves up.

To create this sort of file, the programmer To create this sort of file, the programmer includes a simple expansion program with the includes a simple expansion program with the compressed file. It automatically reconstructs compressed file. It automatically reconstructs the original file once it's downloaded. the original file once it's downloaded.

Page 11: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Looking it UpLooking it Up

But how much space have we But how much space have we actually saved with this system? "1 actually saved with this system? "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4" is not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4" is certainly shorter than "Ask not what certainly shorter than "Ask not what your country can do for you; ask your country can do for you; ask what you can do for your country;" what you can do for your country;" but keep in mind that we need to but keep in mind that we need to save the dictionary itselfsave the dictionary itself..

Page 12: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Looking it UpLooking it Up

In an actual compression scheme, figuring In an actual compression scheme, figuring out the various file requirements would be out the various file requirements would be fairly complicated; but for our purposes, let's fairly complicated; but for our purposes, let's go back to the idea that every character and go back to the idea that every character and every space takes up one unit of memory. every space takes up one unit of memory. We already saw that the full phrase takes up We already saw that the full phrase takes up 79 units. 79 units.

Our compressed sentence (including spaces) Our compressed sentence (including spaces) takes up 37 units, and the dictionary (words takes up 37 units, and the dictionary (words and numbers) also takes up 37 units. This and numbers) also takes up 37 units. This gives us a file size of 74, so we haven't gives us a file size of 74, so we haven't reduced the file size by very much. reduced the file size by very much.

Page 13: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Looking it UpLooking it Up

But this is only one sentence! You can But this is only one sentence! You can imagine that if the compression imagine that if the compression program worked through the rest of program worked through the rest of Kennedy's speech, it would find these Kennedy's speech, it would find these words and others repeated many more words and others repeated many more times. And, as we'll see in the next times. And, as we'll see in the next section, it would also be rewriting the section, it would also be rewriting the dictionary to get the most efficient dictionary to get the most efficient organization possibleorganization possible

Page 14: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Searching for Patterns Searching for Patterns

In our example, we picked out all the repeated In our example, we picked out all the repeated words and put those in a dictionary. words and put those in a dictionary.

To us, this is the most obvious way to write a To us, this is the most obvious way to write a dictionary. dictionary.

But a compression program sees it quite But a compression program sees it quite differently: It doesn't have any concept of differently: It doesn't have any concept of separate words -- it only looks for patterns. And in separate words -- it only looks for patterns. And in order to reduce the file size as much as possible, order to reduce the file size as much as possible, it carefully selects which patterns to include in it carefully selects which patterns to include in the dictionary. the dictionary.

If we approach the phrase from this perspective, If we approach the phrase from this perspective, we end up with a completely different dictionary. we end up with a completely different dictionary.

Page 15: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Searching for Patterns Searching for Patterns

If the compression program scanned Kennedy's If the compression program scanned Kennedy's phrase, the first redundancy it would come across phrase, the first redundancy it would come across would be only a couple of letters long. would be only a couple of letters long.

In "ask not what your," there is a repeated In "ask not what your," there is a repeated pattern of the letter "t" followed by a space -- in pattern of the letter "t" followed by a space -- in "not" and "what." "not" and "what."

If the compression program wrote this to the If the compression program wrote this to the dictionary, it could write a "1" every time a "t" dictionary, it could write a "1" every time a "t" were followed by a space. But in this short were followed by a space. But in this short phrase, this pattern doesn't occur enough to phrase, this pattern doesn't occur enough to make it a worthwhile entry, so the program would make it a worthwhile entry, so the program would eventually overwrite it. eventually overwrite it.

Page 16: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Searching for Patterns Searching for Patterns

The next thing the program might notice is "ou," The next thing the program might notice is "ou," which appears in both "your" and "country." which appears in both "your" and "country."

If this were a longer document, writing this pattern to If this were a longer document, writing this pattern to the dictionary could save a lot of space -- "ou" is a the dictionary could save a lot of space -- "ou" is a fairly common combination in the English language. fairly common combination in the English language.

But as the compression program worked through this But as the compression program worked through this sentence, it would quickly discover a better choice for sentence, it would quickly discover a better choice for a dictionary entry: Not only is "ou" repeated, but the a dictionary entry: Not only is "ou" repeated, but the entire words "your" and "country" are both repeated, entire words "your" and "country" are both repeated, and they are actually repeated together, as the and they are actually repeated together, as the phrase "your country." phrase "your country."

In this case, the program would overwrite the In this case, the program would overwrite the dictionary entry for "ou" with the entry for "your dictionary entry for "ou" with the entry for "your country." country."

Page 17: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Searching for Patterns Searching for Patterns

The phrase "can do for" is also repeated, The phrase "can do for" is also repeated, one time followed by "your" and one time one time followed by "your" and one time followed by "you," giving us a repeated followed by "you," giving us a repeated pattern of "can do for you." pattern of "can do for you."

This lets us write 15 characters (including This lets us write 15 characters (including spaces) with one number value, while spaces) with one number value, while "your country" only lets us write 13 "your country" only lets us write 13 characters (with spaces) with one number characters (with spaces) with one number value, so the program would overwrite the value, so the program would overwrite the "your country" entry as just "r country," "your country" entry as just "r country," and then write a separate entry for "can and then write a separate entry for "can do for you." do for you."

Page 18: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Searching for Patterns Searching for Patterns

The program proceeds in this way, The program proceeds in this way, picking up all repeated bits of picking up all repeated bits of information and then calculating which information and then calculating which patterns it should write to the patterns it should write to the dictionary. This ability to rewrite the dictionary. This ability to rewrite the dictionary is the "adaptive" part of dictionary is the "adaptive" part of LZ LZ adaptive dictionary-based adaptive dictionary-based algorithmalgorithm. .

Page 19: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Searching for Patterns Searching for Patterns

No matter what specific method you use, No matter what specific method you use, this in-depth searching system lets you this in-depth searching system lets you compress the file much more efficiently compress the file much more efficiently than you could by just picking out words. than you could by just picking out words. Using the patterns we picked out above, Using the patterns we picked out above, and adding "__" for spaces, we come up and adding "__" for spaces, we come up with this larger dictionary :with this larger dictionary :1.1. ask__ ask__

2.2. what__ what__ 3.3. you you 4.4. r__country r__country 5.5. __can__do__for__you__can__do__for__you

Page 20: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Searching for Patterns Searching for Patterns

And this smaller sentence: And this smaller sentence:

The sentence now takes up 18 units of The sentence now takes up 18 units of memory, and our dictionary takes up 41 memory, and our dictionary takes up 41 units. So we've compressed the total file units. So we've compressed the total file size from 79 units to 59 units! size from 79 units to 59 units!

This is just one way of compressing the This is just one way of compressing the phrase, and not necessarily the most phrase, and not necessarily the most efficient one. efficient one.

"1not__2345__--__12354"

Page 21: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

How Much Can You Trim?How Much Can You Trim?

So how good is this system? The So how good is this system? The file-reduction ratiofile-reduction ratio depends on a number of factors, including file type, file depends on a number of factors, including file type, file size and compression scheme. size and compression scheme.

In most languages of the world, certain letters and In most languages of the world, certain letters and words often appear together in the same pattern. words often appear together in the same pattern. Because of this high rate of redundancy, Because of this high rate of redundancy, text filestext files compress very well. compress very well.

A reduction of 50 percent or more is typical for a good-A reduction of 50 percent or more is typical for a good-sized text file. Most sized text file. Most programming languagesprogramming languages are are also very redundant because they use a relatively also very redundant because they use a relatively small collection of commands, which frequently go small collection of commands, which frequently go together in a set pattern. together in a set pattern.

Files that include a lot of unique information, such as Files that include a lot of unique information, such as graphics or MP3 files, cannot be compressed much with graphics or MP3 files, cannot be compressed much with this system because they don't repeat many patterns this system because they don't repeat many patterns

Page 22: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Lossy and Lossless Lossy and Lossless

The type of compression we've been The type of compression we've been discussing here is called discussing here is called lossless lossless compressioncompression, because it lets you , because it lets you recreate the original file exactly. recreate the original file exactly.

All lossless compression is based on the All lossless compression is based on the idea of breaking a file into a "smaller" form idea of breaking a file into a "smaller" form for transmission or storage and then for transmission or storage and then putting it back together on the other end putting it back together on the other end so it can be used again. so it can be used again.

Page 23: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Lossy and Lossless Lossy and Lossless

Lossy compressionLossy compression works very differently. works very differently. These programs simply eliminate These programs simply eliminate "unnecessary" bits of information, tailoring "unnecessary" bits of information, tailoring the file so that it is smaller. the file so that it is smaller.

This type of compression is used a lot for This type of compression is used a lot for reducing the file size of bitmap pictures, reducing the file size of bitmap pictures, which tend to be fairly bulky. To see how which tend to be fairly bulky. To see how this works, let's consider how your computer this works, let's consider how your computer might compress a scanned photograph. might compress a scanned photograph.

Page 24: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Lossy and Lossless Lossy and Lossless

A lossless compression program can't do A lossless compression program can't do much with this type of file. much with this type of file.

While large parts of the picture may look While large parts of the picture may look the same -- the whole sky is blue, for the same -- the whole sky is blue, for example -- most of the individual pixels example -- most of the individual pixels are a little bit different. To make this are a little bit different. To make this picture smaller without compromising the picture smaller without compromising the resolution, you have to change the color resolution, you have to change the color value for certain pixels. value for certain pixels.

Page 25: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Lossy and Lossless Lossy and Lossless

If the picture had a lot of blue sky, the If the picture had a lot of blue sky, the program would pick one color of blue that program would pick one color of blue that could be used for every pixel. could be used for every pixel.

Then, the program rewrites the file so that Then, the program rewrites the file so that the value for every sky pixel refers back to the value for every sky pixel refers back to this information. this information.

If the compression scheme works well, you If the compression scheme works well, you won't notice the change, but the file size won't notice the change, but the file size will be significantly reduced. will be significantly reduced.

Page 26: Data Compression. How File Compression Works If you download many programs and files off the Internet, you've probably encountered ZIP files before. This

Lossy and Lossless Lossy and Lossless

Of course, with lossy compression, you Of course, with lossy compression, you can't get the original file back after it has can't get the original file back after it has been compressed. been compressed.

You're stuck with the compression You're stuck with the compression program's reinterpretation of the original. program's reinterpretation of the original.

For this reason, you can't use this sort of For this reason, you can't use this sort of compression for anything that needs to be compression for anything that needs to be reproduced exactly, including software reproduced exactly, including software applications, databases and presidential applications, databases and presidential inauguration speeches. inauguration speeches.