MaVS

Huffman Coding

Lesson
Visual

In much of computing, we run up against a very practical problem: the complicated devices we use to store digital information are expensive. We prefer to live in the world of the theoretical as much as possible, but eventually accounting has to have a say in how our programs work.

At some point every developer learns that the hardest thing in computer science has nothing to do with programming, and everything to do with avoiding getting killed by an angry finance department.

So, say we're trying to store a ton of text. We may at first reach for some of the encodings we've used before, such as ASCII, UTF-8, etc. These are fine, but they may eventually make accounting rather mad at us—they take up way too much space!

Let's just focus on ASCII for now. Every single time we have to store a character, we're taking up a whole 8 bits! That may not sound like a ton, but with enough text it becomes rather huge.

Wouldn't it be nice if we could make things smaller? We've already seen that if we make every character the same size, we have to use our same 8 bits just to fit all the symbols we need. So, things are going to have to be different sizes.

Thinking about it for awhile, we may realize that we use the letter e rather often, but Z is rather uncommon—why can't we "steal" some size off of one and give it to the other? Our code for Z can be a good bit longer, and as long as that lets us make e pretty short, we'll save space overall!

LetterCode
e0
z110100101110101
(Maybe not quite this intense)

So, we now have two new problems:


You may also realize pretty quickly that this is going to be different for different text. An article about pizza is going to have a ton of z's, so we can't universally say that z isn't used often.


How do we do that?

This is a pretty complex problem! Luckily for us, we have an elegant solution found by David A. Huffman in a 1952 MIT paper.

We start by taking some block of text we want to encode, such as:

Huffman Coding

Our first step is to count characters, and see how often each one shows up. We'll then sort by frequency:

Here we see that "n" occurs the most frequently, and "H" the least.

In fact, let's look at the two least frequent: u and H, with their respective number of appearances in the text: 1|u and 1|H.

We'll squish these together and create a structure like this. We put each letter and it's frequency in their own nodes, and then take their combined counts and put it in a new node above them.

Now that we've created this little tree, we're gonna use it! Because we built it out of the smallest elements in our list, we know that the added-up frequency is as small as it can be. We'll place our whole brand-new tree into the list as a single "letter" with the added-up frequency, making sure to keep the list sorted.

We then simply repeat this process over and over again. Take the bottom two things off the list, combine them together and add them up, and put them back in the list. Step through the building process below:

At this step:
First, we take every character in the text, find how many times each appeared, and sort the list by frequency.
Then, we take the two least frequent elements (#10 and #11), and combine them into a new subtree.
This new subtree will have a frequency weight of 2, and will be sorted back into the list.
Step: 1 / 12

Wait so why did we do all that?

It still isn't super clear why we built this tree, but bear with us a little longer. First, we're gonna label all of the branches. Any time we branch left, we'll label it 0. If we branch right, we'll label it 1. So, we get:

Note that if you're on mobile, this is rotated for better visual clarity.

Now, notice that if we "go down" the first 0 branch and ignore everything outside of it, we cut everything on the 1 side away. If we start by going down that left branch, we know we are guaranteed that we cannot reach anything on the right.

So, we keep cutting down branches, until finally we get to a node without any other nodes below it. At this point, we cannot get to absolutely anything else other than the text contained inside that node, and so we stop.

Therefore, with a specific pair of 0s and 1s, we can reach any letter in our tree. That sounds like an encoding! Lets write out all of these paths:

CharPath
i000
g001
f010
n011
H1000
u1001
m1010
a1011
1100
C1101
o1110
d1111

If you just looked at these codes on their own, it might feel hard to prove that there isn't any overlap. How would we know for sure that one code starts somewhere and ends somewhere else?

But, if you think back to the tree, realize that as we "take a path" by having either a zero or a one, there's no way to get confused about where we're going. You'd always end up somewhere!


Earlier, we presented two problems. We've solved the first: we know where characters start and end.

Sneakily, though, we also solved the second! Because we kept our list sorted by frequency, things that didn't occur much got shoved to the very bottom of the tree. Thus, they have the longest paths!


So, how'd we do?

We've done all this, but how much has it actually helped?

The ASCII encoded version of our string takes 112 bits:

While the Huffman coding takes 50 bits:

That's a pretty major reduction!

Note that we also have to store our encodings alongside the text somehow, mapping something more standard like ASCII to our Huffman-encoded characters. We omit those considerations here, as once you get to sufficiently large amounts of text where encoding matters, the small block of mappings becomes negligible.

So, now that we've made accounting a bit happier, we'll live to see another day! While you're here, go ahead and change the text used in this article. Put anything in below, and scroll back up to see the changes!


Here from SoME?

Thanks for checking out our project! Please consider Donating if you found this lesson interesting and want to see more like it. Please also check out the rest of our site!