Hamming Distance and Error Correcting Codes

You may already be aware that one transmits information, whether it be over the internet, or from a DVD to a DVD Player, -- the information is first translated to a sequence of $1$'s and $0$'s, and then transmitted using some sort of physical system (e.g., ethernet cable, laser light, wi-fi signal, etc).

Now, there are various ways we can associate the various "chunks" of information with their related binary sequences. For example, if the "chunks" of information represented letters of the alphabet, one might wish to follow the ASCII standard, where one starts with 'A' = 65, 'B' = 66, 'C' = 67, ... and then each number is turned into its binary equivalent. (i.e., 'A' = 01000001, 'B' = 01000010, 'C' = 01000011, ...)

Of course, how many digits we need will ultimately depend upon the number of letters (or symbols, "chunks", etc.) that we need to encode.

Unfortunately, sometimes the transmission experiences some form of interference. This interference might be due to scratches on the DVD, or electrical equipment in near proximity to a router or cable, etc. and has the effect of altering the sequences (most often binary sequences) in one or more places.

To combat this interference, we often have to build in some redundancy in our transmission. This is so that either errors can be detected and a request for a retransmission can be made, or so that errors can be not only detected, but corrected. These redundancies come at a price, however. They necessarily increase even further the number of digits needed to transmit each letter or "chunk".

To attach some verbiage to all of this, we define a block code of length $n$ containing $M$ codewords over an alphabet $A$ as the set of $n$-tuples where each $n$-tuple takes its components from $A$. For brevity, we refer to such a block code as an $[n,M]$-code over $A$.

So, for example $C = \{a, b, c, d\}$ whose elements are shown below is a $[5,4]$-code over $\{0, 1\}$:

$a = (00000)$
$b = (10110)$
$c = (01011)$
$d = (11101)$

We can think of each codeword in a block code as a "point", and define a "distance" between these points by simply counting the number of coordinate positions in which they differ. This number is called the Hamming distance $d(x,y)$ between two codewords $x$ and $y$, and can easily be shown to be a metric.

As an example of calculating the Hamming distance between codewords, note that (using the block code $C$ above) $d(a,b) = d(c,d) = 3$, while $d(b,c) = 4$.

Having a metric between codewords allows us to consider which codewords are "near" others, and which are "farther away".

Specifically, we can consider the set of all codewords at distance less than or equal to some given length from some given codeword. In doing so, we are essentially considering a "circle" (and its interior), where the given codeword is the center of that circle and the given length is the length of the circle's radius.

If a codeword experiences an error in transmission, where some small number of its digits have been changed, the resulting sequence should still be "near" its intended sequence. As such, it should fall within a circle of some small radius about some codeword. If there is only one such codeword, then the error-containing sequence can be "corrected" by assuming the center codeword's sequence was actually intended.

To maximize the number of corrections that can be made, one must determine the largest radius for circles centered at codewords that will produce no circles that overlap.

To this end, we make one more definition:

Let $C$ bye an $[n,M]$-code. The Hamming distance $d$ of code $C$ is given by

$$d = \textrm{min } \{ d(x,y): x, y \in C, x \ne y \}$$

As an example, again using the code $C$ given above, we see that since

$$\begin{array}{rcl}
d(a,b) &=& 3\\
d(a,c) &=& 3\\
d(a,d) &=& 4\\
d(b,c) &=& 4\\
d(b,d) &=& 3\\
d(c,d) &=& 3\\
\end{array}$$

its related Hamming distance must be the minimum distance seen above, $d=3$.

Once we have computed the Hamming distance, $d$, for some given $[n,M]$-code $C$, we can be assured that $C$ will be able to correct up to $\left\lfloor \frac{d-1}{2} \right\rfloor$ errors per "chunk" transmitted.

◆ ◆ ◆