Character Encoding and Compression
Suppose you sent an email that began with "Hello there".
Each character in that string would require one byte of storage,
and because there are 8 bits per byte, the string would require
88 bits.
There is a standard set of binary values that are used to represent
characters, called the ASCII code, or the American Standard Code
for Information Interchange. Here are the ASCII codes that are used
to represent the characters of our example string:
| Character | | ASCII Code |
| H | | 01001000 |
| e | | 01100101 |
| l | | 01101100 |
| o | | 01101111 |
| (space) | | 00100000 |
| t | | 01110100 |
| h | | 01101000 |
| r | | 01110010 |
|
| |
Using standard bit codes to represent characters is what allows
us to send email back and forth. When your email software receives
01100101 for example, it assumes that someone is sending
you an 'e'. If both sides of an email exchange didn't agree on
which characters were assigned to which bit codes, the characters
would become garbled.
|
|
Using the ASCII standard binary codes, "Hello there" looks like this:
0100100001100101011011000110110001101111001000000111010001101000011001010111001001100101
| |
The ASCII binary code is called a fixed-length code, because it uses
a fixed number of bits to represent each character, but it isn't the
only code. As long as we use the same code to encode and decode
characters, we can use any code we wish. For example, if we assigned
the values as shown on the right to the characters of our message,
then "Hello there" would be represented thus:
|
|
| Character | | Made-up Code |
| H | | 000 |
| e | | 001 |
| l | | 010 |
| o | | 011 |
| (space) | | 100 |
| t | | 101 |
| h | | 110 |
| r | | 111 |
|
000001010010011100101110001111001 = 33 bits
This is much more compact than the ASCII representation.
With 8 unique characters, using a code of only 3 bits per character
is pretty efficient, and is the best we can do with a fixed-length
code. However, what fixed-length codes do not take into account is
that some characters may occur more frequently than others. It is
possible to take advantage of that to compress the data further.
| |
Huffman coding is a technique of assigning variable length binary
codes to characters according to their frequency. The idea is to get
improved compression by using codes with fewer bits to represent
characters that occur the most often. If we assigned the values on
the right to the characters of our message, then "Hello there" could
be represented with one fewer bit than the above representation.
|
|
| Character | | Huffman Code |
| H | | 101 |
| e | | 01 |
| l | | 001 |
| o | | 111 |
| (space) | | 100 |
| t | | 0001 |
| h | | 110 |
| r | | 0000 |
|
10101001001111100000111001000001 = 32 bits
Variable length encoding does have its subtleties. It must be
decoded one bit at a time, and the first match determines which
character is represented. Note that the codes 000, 010, and 011
were not used in the above example. In fact they could not have
been used, as they would have conflicted with the other codes
when it came time to decode the stream of bits. 'e's short code
of 01 would cause the decoder to misinterpret either 010 or 011
as an 'e' followed by a bit which would in turn cause another
misinterpretation of the next code. Computing the best codes to
use for a given message is a task best left to a program.
| |
Here's a final example, along with C source code to the software
used to compute the variable length codes and to encode and decode
arbitrary disk files. The message "the rain in spain falls mainly
in the plain" is 43 characters long and contains 14 unique characters.
It can be encoded in 151 bits with variable-length per-character codes.
huffman.h
huff_enc.c
huff_dec.c
The fact that a single code might be used to represent "ain" is
a topic for another day.
|
|
| Character | Occurrences | Code |
| (space) | 8 | 000 |
| i | 6 | 010 |
| n | 6 | 011 |
| a | 5 | 100 |
| l | 4 | 0010 |
| e | 2 | 1011 |
| h | 2 | 1100 |
| p | 2 | 1101 |
| s | 2 | 1110 |
| t | 2 | 1111 |
| f | 1 | 00110 |
| m | 1 | 00111 |
| r | 1 | 10100 |
| y | 1 | 10101 |
|