Character Encoding and Compression

Suppose you sent an email that began with "Hello there". Each character in that string would require one byte of storage, and because there are 8 bits per byte, the string would require 88 bits.

There is a standard set of binary values that are used to represent characters, called the ASCII code, or the American Standard Code for Information Interchange. Here are the ASCII codes that are used to represent the characters of our example string:

Character ASCII Code
H 01001000
e 01100101
l 01101100
o 01101111
(space) 00100000
t 01110100
h 01101000
r 01110010
  Using standard bit codes to represent characters is what allows us to send email back and forth. When your email software receives 01100101 for example, it assumes that someone is sending you an 'e'. If both sides of an email exchange didn't agree on which characters were assigned to which bit codes, the characters would become garbled.

Using the ASCII standard binary codes, "Hello there" looks like this:
0100100001100101011011000110110001101111001000000111010001101000011001010111001001100101

  The ASCII binary code is called a fixed-length code, because it uses a fixed number of bits to represent each character, but it isn't the only code. As long as we use the same code to encode and decode characters, we can use any code we wish. For example, if we assigned the values as shown on the right to the characters of our message, then "Hello there" would be represented thus:
Character Made-up Code
H 000
e 001
l 010
o 011
(space) 100
t 101
h 110
r 111

000001010010011100101110001111001   =   33 bits

This is much more compact than the ASCII representation. With 8 unique characters, using a code of only 3 bits per character is pretty efficient, and is the best we can do with a fixed-length code. However, what fixed-length codes do not take into account is that some characters may occur more frequently than others. It is possible to take advantage of that to compress the data further.

  Huffman coding is a technique of assigning variable length binary codes to characters according to their frequency. The idea is to get improved compression by using codes with fewer bits to represent characters that occur the most often. If we assigned the values on the right to the characters of our message, then "Hello there" could be represented with one fewer bit than the above representation.
Character Huffman Code
H 101
e 01
l 001
o 111
(space) 100
t 0001
h 110
r 0000

10101001001111100000111001000001   =   32 bits

Variable length encoding does have its subtleties. It must be decoded one bit at a time, and the first match determines which character is represented. Note that the codes 000, 010, and 011 were not used in the above example. In fact they could not have been used, as they would have conflicted with the other codes when it came time to decode the stream of bits. 'e's short code of 01 would cause the decoder to misinterpret either 010 or 011 as an 'e' followed by a bit which would in turn cause another misinterpretation of the next code. Computing the best codes to use for a given message is a task best left to a program.

  Here's a final example, along with C source code to the software used to compute the variable length codes and to encode and decode arbitrary disk files. The message "the rain in spain falls mainly in the plain" is 43 characters long and contains 14 unique characters. It can be encoded in 151 bits with variable-length per-character codes.

huffman.h
huff_enc.c
huff_dec.c

The fact that a single code might be used to represent "ain" is a topic for another day.

CharacterOccurrencesCode
(space)
8
000
i
6
010
n
6
011
a
5
100
l
4
0010
e
2
1011
h
2
1100
p
2
1101
s
2
1110
t
2
1111
f
1
00110
m
1
00111
r
1
10100
y
1
10101