Optimal Jargon | Puzzles for Programmers and Pros

Everywhere you look, you find mini-fads and their accompanying industries. Each comes with a bunch of characteristic phrases that often yield to a kind of popular jargon. To send a text message to someone becomes “to text them.” To find information using the Google search engine becomes “to google it.” Such shortening of phrases is not unique to our time. Movie sets shorten the phrase “we are about to start cameras so actors prepare yourselves and everyone else be quiet” to the single word “camera.” Languages borrow jargon from one another, because of the economy of expression they offer. Even linguistically proud France borrows “milestone” and “IP” from English, whereas it would never think of borrowing, say, “spoon.”

As puzzlists, we can help make better jargon. To show you how, we’ll abstract the phrases into letters and find an encoding in terms of bits. The marketeers can find words to fit later. We’ll be very concrete to avoid having to estimate probabilities.

Suppose that every message communicated has 60 characters and each character is one of A, B, C, or D. If you have to encode these in bits and you know nothing else, you might encode them as follows:

 A–00 B–01 C–10 D–11

Because there are 60 characters and each character is encoded in two bits, the whole message requires 120 bits (15 bytes). Whereas this yields a far more succinct expression than the one byte per character encoding of ASCII, for example, some more information might improve things still further.

Warm-Up

Suppose that we knew that every message of 60 characters consisted of exactly 30 As, 15 Bs, 10 Cs, and 5 Ds. What would be a good encoding in bits of the characters then?

Solution to Warm-Up

Intuition tells us that a good encoding should render A in the fewest bits, then B, then C, and then D. It is that intuition that guided Samuel Morse in his invention of the code that bears his name. But Information Theory could conceivably help us do better.

As originally conceived by Claude Shannon, Information Theory was closely inspired by a statistical mechanics interpretation of thermodynamics. Shannon defined a notion of “entropy” as a formula involving the probability of each character (e.g., A has probability 30/60 or 1/2; whereas D has probability 5/60 or 1/12), and logs of probabilities. The entropy describes the weighted average length of an optimal encoding in bits of a single character. The entropy formula in this case yields: (1/2 log(2)) + (1/4 log(4)) + (1/6 log(6)) + (1/12 log(12)) where all the logs are to the base 2. This comes out to about 1.73 bits per character. Of course, a designer is free to choose the encoding he or she wishes, and we choose a whole number of bits for each character, as follows:

 A–1 B–01 C–000 D–001

In our encoding, sending A thirty times would cost 30, sending B 15 times would cost 30 as well, and so on. The total length would be 30 + (15 × 2) + (10 × 3) + (5 × 3) = 105 bits. This comes out to 105/60 or 1.75 bits per character. So the intuitive design is quite good.

So far, we’ve considered only the single character frequencies. We may in fact know more. For example, in English, “q” is rare and so is “u,” but “u” almost always follows “q.” So we might consider encoding “qu” as a single character. Could this help us?

Suppose that in addition to the frequencies given in the warm-up, you know that in every message B is always followed by A and D is always followed by C. Could you arrive at a shorter encoding of the message?

To be really useful, jargon must reduce long phrases to a word or two. So, let’s see if we can simulate that case.

Suppose that in addition to the frequencies given in the warm-up, you know that in every message, D is always followed by C, C is always followed by B, and B is always followed by A. Could you arrive at an even shorter encoding of a 60-character message? Which phrases should be rendered as jargon?

English is famous for its exceptions to almost all rules (e.g., the country “Iraq” has a “q” that is not followed by a “u”). Such exceptions could force an optimal encoding to include a codeword for “q” as well as one for “qu.” But if there were many exceptions, then the extra codewords might not be worth it. Let’s see how this plays out in our mini-language.

Suppose that in addition to the frequencies given in the warm-up, you know that in every message, D is always followed by C, C is always followed by B, and B is followed by A 13 out of 15 times? What is the shortest encoding you could find of a 60-character message?