Lead-in Area, Data Area, Lead-out Area, and TOC | CD Cracking Uncovered: Protection Against Unsanctioned CD Copying (Uncovered series)

Basics of Error-Correcting Codes and Error-Correcting Encoding

Personal computers and bits and bytes have so inundated our everyday life that most people have ceased to think about information-encoding theory altogether, thinking of it as something self-evident. But things are not as simple as they might seem at the first glance.

In fact, encoding involves nothing more than the conversion of a message into a sequence of code symbols, also called codewords . Any discrete message contains a finite number of elements. In particular, any text is made up of letters , an image is made up of pixels, while a machine program is made up of commands, etc. Together, these building blocks form the message source alphabet. In the course of encoding, message elements are converted into corresponding numbers ” code symbols, where each message element is assigned a unique sequence of code symbols, called a code combination. The set of code combinations that makes up the message is the code . The set of possible code symbols is called the code alphabet, while their total number (later on, designated by a lower-case m ) is the code base .

Most likely, however, that you are aware of these facts already (even if you aren t, you can find a comprehensive explanation of encoding basics in any textbook on computer science). Are you, however, familiar with what is meant by the term Hamming distance? The Hamming distance is the minimum number of differences between any two valid codewords and it plays a fundamental role in the theory of error-correcting encoding. An official definition is as follows : Hamming distance is a measure of the difference between two messages, each consisting of a finite string of characters , expressed by the number of characters that need to be changed to obtain one from the other. For instance, let us consider the following four-bit code:

Listing 2.1: An example of the simplest four-bit code with Hamming distance equal to one. Such a code is widely used in computing, despite its inability to detect errors

     0000;   4     0100;    8     1000;   12     1100;  1     0001;   5     0101;    9     1001;   13     1101;  2     0010;   6     0110;   10     1010;   14     1110;  3     0011;   7     0111;   11     1011;   15     1111;

This is a normal binary code that might be encountered in some microcontrollers that hold 16 symbols in their 4 bits (meaning that, using this code, it is possible to encode 16 alphabetical characters). As can be easily checked, any two characters differ by at least one bit. Consequently, the Hamming distance for such a code is equal to 1 (e.g., d = 1 ).

Here is another four-bit code:

Listing 2.2: An example of a four-bit code with Hamming distance equal to 2. This code is already capable of detecting single errors

     0000;   4     1001;  1     0011;   5     1010;  2     0101;   6     1100;  3     0110;   7     1111;

This time, any two arbitrarily taken symbols are different in at least two positions . Because of this, the information capacity of such a code has been reduced from 16 to 8 symbols. But wait, some of you might cry out in surprise, what gibberish? Where are combinations such as 0001 or 0010 , for instance? But this isn t gibberish. This code doesn t actually contain these combinations. To be more precise, they are present, but are declared code violations, also known as noncode symbols or prohibited symbols. Because of this circumstance, our code is capable of detecting any single error. For instance, let us take the 1010 symbol and invert there any arbitrary (but a single) bit. Suppose that this is the second bit from the left. The corrupted symbol will then appear as follows: 1 1 10 . Since the 1110 combination is not a valid codeword, the decoder will report the occurence of an error. It can, alas, only detect an error, and is unable to fix it. In order to correct even a single erroneous bit, it is necessary to increase the Hamming distance to at least 3. Since four-bit code with d = 3 can comprise only two different symbols, it is not illustrative . Therefore, it is better to choose a wider code. Let s try a 10-bit code where d = 5 .

Listing 2.3: An example of 10-bit code with Hamming distance equal to 5. This code can detect 4-bit errors and correct 2-bit errors

 0000000000   0000011111   1111100000   1111111111

For example, let us take the 0000011111 symbol and invert any two bits. As a result, we will get something looking like 1 0011 111 . Since this combination is not a valid codeword, the decoder detects that an error has occurred. Obviously, if the number of erroneous bits is smaller than the Hamming distance at least twice, the decoder is guaranteed to restore the initial symbol. In fact, if there are no less than 5 differences between any two valid symbols, then corruption of two bits of any such symbol will produce a new symbol of the alphabet (let us designate it as k ). The Hamming distance between k and the original symbol is equal to the number of inverted bits (e. g., two in our case), while the distance to the nearest valid symbol is equal to d “ k (3, in our case). In other words, while d “ k > k , the decoder is guaranteed to restore the original symbol. In cases when d > k > d “ k , successful restoration is not guaranteed. However, under favorable conditions it is possible in principle.

Returning to our symbol 0000011111 , let us corrupt 4 bits: 1 00 1 1 1 1 and try to restore the original symbol. Let us represent the recovery process graphically:

Listing 2.4: An attempt at correcting a 4-bit error

   00  00          01  1  1  1  1  1  1  11  1      0   1  1  11  11  1  1  1  1  0  1  00  11   1   1   1  00  1  1   1   1   1  00  1  1   1   1   0  1  00  11   1   1  ----------   ----------   ----------   --------- 5 differences    4 differences    6 differences    5 differences

Roughly speaking, after detecting an error, the decoder sequentially compares the corrupted symbol with all of the valid symbols of the alphabet, trying to find the one that has the smallest number of differences with the corrupted one. To be more precise, it looks for the symbol that differs from the corrupted one in no more than (d “ 1) bits. As can be seen easily, in this case we were lucky and the restored symbol actually matched the original one. However, if four corrupted bits were distributed as follows: 1111 11111 , the decoder would consider it to be 1111111111 , and the restoration would be incorrect.

Thus, the correcting capability for a specific code is defined according to the following formula: to detect r errors, the Hamming distance must be greater than or equal to r , while for correcting r errors the Hamming distance must be greater than the duplicated number of errors r by at least one.

Listing 2.5: The correcting capabilities of a simple Hamming code

 Error detection capability: d >= r  Error correction capability: d > 2r  Information capacity: 2 ^n/d

Theoretically, the number of detectable errors is unlimited, while, in practice, the information capacity of codewords dwindles rapidly with the growth of d . Suppose that we have 24 bytes of data and would like to correct up to two errors per each block. In this case, we will have to add 49 bytes more to this block, and, as a result, its information capacity would drop down to 30 percents! Is this a positive development? This deplorable result can be explained by the fact that bits of the codeword are isolated from one another, and changing one of them has no effect on the others. Is there any way out of this situation? There is!

Let all bits whose numbers represent powers of two, play the role of check bits, while the other bits will be normal bits carrying the information of the message. Each check bit must be responsible for the parity ^[i] of a specific group of bits controlled by it. Note that the same information bit can relate to different groups. One information bit will influence several check bits and, therefore, the information capacity of the codeword grows considerably (even tremendously). After that, it only remains to choose the optimal distribution of the spheres of influence.

According to the methods of error-protected encoding suggested by Hamming, in order to determine which check bits control the information bit in position k , it is necessary to factorize k by powers of two.

Table 2.1: Dividing bits into check bits and data bits
Position	Controlled by bits
1(A)	2 =1	This is a check bit, it isn t controlled by any bit.
2 (B)	2 ¹ =2	This is a check bit, it isn t controlled by any bit.
3	2 +2 ¹ =1+2=3	Controlled by check bits 1 and 2.
4 (C)	2 ² =4	This is a check bit, it isn t controlled by any bit.
5	2 +2 ² =1+4=5	Controlled by check bits 1 and 4.
6	2 ¹ +2 ² =2+4=6	Controlled by check bits 2 and 4.
7	2 +2 ¹ +2 ² =1+2+4=7	Controlled by check bits 1, 2 and 4.
8 (D)	2 ³ =8	This is a check bit, it isn t controlled by any bit.

Let s try to get a live taste of Hamming codes and manually calculate the checksum of the 0101 four-bit symbol. After reserving places for check bits (highlighted in bold) our symbol will appear as follows: AB C 101 . Now, it simply remains to calculate the values for bits A , B , and C :

Bit A controlling bits 3, 5, and 7 is equal to zero, because their sum (0 + 1 + 1) is even.
Bit B controlling bits 3, 6, and 7 is equal to one, because their sum (0 + 0 + 1) is odd.
Bit C controlling bits 5, 6, and 7 s equal to zero, because their sum (1 + 0 + 1) is even.

Thus, the newly created codeword will appear as follows: 01 101 , where the check bits are highlighted in bold.

Listing 2.6: The codeword with check bits

  AB   C  101  1234567

Now let us suppose that in the course of transmission, one of the bits of our codeword was inverted. As a result, the codeword began to appear as follows: 01 111 . Can we detect such an error? Let s try. Bit A must be equal to: (0 + 1 + 1) % 2 = 0 , which is true. Bit B must be equal to (0 + 1 + 1) % 2 = 0 . But in our codeword, it is equal to one. Let us memorize the number of incorrect bit and continue. Bit C must be equal to (1 + 1 + 1) % 2 = 1 ; however, it is equal to zero. Aha! The check bits in positions 2 (bit B ) and 4 (bit C ) do not match reality. Their sum (2 + 4 = 6) gives the position of the erroneous bit. Actually, in this case, the number of the mangled bit is equal to 6. Let s invert it to return our codeword into its original state.

What will happen if the corruption happens to a check bit rather than an information bit? The test will show that in this case, the corrupted bit can also be detected successfully. In this case, it can also be restored using the above-described technique (however, does this make any sense? After all, check bits are discarded in the course of decoding the codeword anyway).

At first glance, it might seem that Hamming codes are horribly inefficient, since for each 4 data bits we have 3 check bits. However, since the numbers of check bits are powers of two, they become more and more sparse with the growth of the codeword length. Thus, the check bit D , nearest to bit C , is located in position 8 (e.g., three steps away). The check bit E , however, is divided from bit D by 2 ⁴ “ 2 ³ “ 1 = 7 steps, while check bit F is at a distance of 2 ⁵ “ 2 ⁴ “ 1 = 15 steps.

Thus, the efficiency of Hamming codes rapidly increases with the growth of the length of the processed block. The program provided below clearly illustrates this:

Listing 2.7: Calculation of the effective information capacity of Hamming codes for codewords of different length

 main()  {           int a;           int _pow = 1;           int old_pow = 1;           int N, old_N = 1;           printf("* * * Hamming code efficiency test * * * by Kris Kaspersky\n"\               " BLOCK_SIZE    FUEL UP   EFFICIENCY\n"\               " --------------------------------------------------\n);           for (a = 0; a < MAX_POW; a++)           {          N = _pow   old_pow   1 + old_N;           printf("%8d   %8d   %8.1f%%\n", _pow, N, (float) N/_pow*100);           // NEXT           old_pow = _pow; _pow = _pow * 2; old_N = N;           } printf (" ------------------------------------------- \n);  }

Listing 2.8: Calculation of the effective information capacity of Hamming codes for codewords of different length

 BLOCK_SIZE   FUEL UP     EFFICIENCY  ----------------------------------------      1          0           0.0%       2          0           0.0%       4          1          25.0%       8          4          50.0%      16         11          68.8%      32         26          81.3%      64         57          89.1%     128        120          93.8%     256        247          96.5%     512        502          98.0%    1024       1013          98.9%    2048       2036          99.4%    4096       4083          99.7%    8192       8178          99.8%   16384      16369          99.9%   32768      32752         100.0%   65536      65519         100.0%  131072     131054         100.0%  262144     262125         100.0%  524288     524268         100.0%  -----------------------------------

From the listing provided above, it follows that in the course of processing blocks of a size of at least 1,024 bits, the overhead for processing check bits can be fully neglected.

Unfortunately, Hamming codes can correct only single errors, which means that they can tolerate the corruption of only one bit per entire block being processed. Naturally, error probability increases with the size of blocks being processed. Therefore, the choice of optimal codeword length is a non-trivial task, requiring at least the knowledge of error types and frequency, at which they occur in the communication links. For tape drives , CDs, hard disks, and other similar devices in particular, Hamming codes prove to be extremely inefficient. Why do we consider them? Well, for the simple reason that advanced encoding systems, including Reed-Solomon codes, are impossible to understand if you start attacking them from scratch. This is because they are actually based on complicated mathematical algorithms. However, after considering simpler methods such as Hamming codes, this, in fact, becomes possible.

^[i] E.g., if the sum of bits being checked is even, then the parity bit is equal to zero, and if it is odd, the parity bit is set to 1.