A deeper dive into classical error correction#

Linear classical codes#

The previous notebook showed an example method, called coding, to protect information. Now we formalize these notions. In the following, the words in bold are the ones being defined.

The definition of a code#

We define the abstract notion of a code using the notions of linear algebra. The binary field is just the numbers $\set{0,1}$, along with their addition and multiplication.

+ 0 1
0 0 1
1 1 0
$\times$ 0 1
0 0 0
1 0 1

The binary field is labelled $\mathbb{F}_2$. A $m$-dimensional vector space over the binary field is labelled as $\mathbb{F}^m_2$.

The following demonstrates how to work with binary fields in python.

import numpy as np

# declare your vectors and matrices as numpy.array
# with a data-type of int, rather than the default
# type of float
M = np.array([[1, 0], [1,1]], dtype=int)
v = np.array([1, 1], dtype=int)

# if you add things as
print('M + M =\n', M+M, 'incorrect')
# you will get the wrong answer. '2' is not part of our number system.

# Instead mod everything 2. This is done with '% 2' in python
print('M + M =\n', (M+M) % 2, 'correct')
# The parenthesis in (M+M) % 2 are important

# Similarly,
print('Mv = ', M@v, 'incorrect')
print('Mv = ', M@v %2, 'correct')
M + M =
 [[2 0]
 [2 2]] incorrect
M + M =
 [[0 0]
 [0 0]] correct
Mv =  [1 2] incorrect
Mv =  [1 0] correct

As discussed above, the information to be sent exists in the form a some long data stream. However, linear codes usually break up the stream into blocks of fixed size, and treats each block independently. In the repetition code example, the blocks were of size 1. Therefore, for our analysis, we only need to consider the blocks. We establish the following terminology.

The information to be protected is a bit string of length $k$, and each such string is called a message.

The message space is the vector space with all $2^k$ possible messages. This vector space is clearly defined over $\mathbb{F}_2$. Hence, it is the space $\mathbb{F}^k_2$.

Question: What is the dimension of the message space?

As in the repetition code, for linear codes each message is transformed into a longer bit-string before being transmitted. This longer bit-string is called a codeword, and the tranformation from message to codeword is called encoding the message.

A message is transformed or encoded to a codeword, which is a bit string of length $n$.

The codewords are vectors in a larger space, defined as follows.

The codespace is a $n$-dimensional vector space over $\mathbb{F}_2$, labelled $\mathbb{F}^n_2$. We require that $n \ge k$.

Finally, we can abstractly define the notion of a code.

A binary linear code $\vecs{C}$ of dimension $k$ and length $n$ is a $k$-dimensional subspace of $\mathbb{F}^n_2$. Compactly, we can write $\vecs{C} \subset \mathbb{F}^n_2$. Such a code is labeled as a $[n,k]$ code.

On it's own, the above definition allows us to identify if a particular construction is a valid code.

Question: Check that the repetition code described above is indeed a code. Identify the message space, the codewords, and the codespace.

Other codes that you will encounter can be defined as above. However, we will need more structure in order to understand and make use of codes.

Encoding process#

The definition of a binary linear codes is abstract. To encode messages we need a more concrete definition that gives well-defined calculational procedures. This is done by given an explicit definition of the subspace at the heart of the linear code definition.

For any code, we can identify a basis set of the subspace. For instance, for the repetition code, the basis set is $\set{000,111}$. In general, there will be $k$ such vectors because the subspace is $k$-dimensional, and each vector will be of length $n$. We can collect these $k$ vectors into the rows of a matrix, called the generator matrix.

The generator matrix, $G$, is a $k \times n$ matrix whose rows are a basis of the code.

Any message $m$ can be encoded into a codeword $c$ using the generator matrix. The codewords of the code are all in the row space of its generator matrix $G$, i.e. the set of codewords is \begin{equation} \set{c = mG : m \in F^k_2}, \end{equation} where $m$ is the message in the form of a row vector. Note that we are doing matrix-vector multiplication in the opposite way that we usually do.

Example: For the repetition code, a generator matrix is \begin{equation} G = \begin{pmatrix} 1 & 1 & 1 \end{pmatrix}. \end{equation} Using this, we find the codewords to be \begin{align} \begin{pmatrix} 0 \end{pmatrix} \begin{pmatrix} 1 & 1 & 1 \end{pmatrix} = \begin{pmatrix} 0 & 0 & 0 \end{pmatrix}, \\ \begin{pmatrix} 1 \end{pmatrix} \begin{pmatrix} 1 & 1 & 1 \end{pmatrix} = \begin{pmatrix} 1 & 1 & 1 \end{pmatrix}, \end{align} which is what we had before. In this way, we have an explicit encoding process for any code.

A code is specified entirely using a generator matrix.

Because the basis for the code $C$ is not unique, there can be many possible generator matrices for the same code $C$, and as we will see, some of matrices will have more calculational utility than others. We can convert between different codes by doing row operations on the generator matrix, which will not change the row space. Among the forms of the generator matrices, there is a standard form, which is when $G = [I_k| A]$, where $I_k$ is the $k\times k$ identity matrix and $A$ is a $k \times (n-k)$ matrix. The standard form can be obtained via row operation from any other form.

Hamming codes#

One of the most interesting codes is the Hamming codes. This is actually a family of codes, but we will discuss the simplest among them. In this code, messages of length $3$ are encoded into codewords of length $7$. The convention is to call this the $[7,4]$ Hamming code.

Question: Identify the message space and the codespace of the Hamming code.

To understand the Hamming code, we need to first define the notion of parity.

The parity of a bit string is the number of ones in it. If there are even number of ones, the parity of the string is said to be even, otherwise it is odd.

Note that $4C3 = 3$. In the $[7,4]$ Hamming code, for each combination of three of the four data bits, we calculate the parity. This is visually depicted in the diagram below where $b1-b4$ are the data bits and $p1-p3$ are parity bits.

Hamming code

We now construct the generator matrix $G$ for the Hamming code, using the description above. The first four bits have to be copied as is, so the first four columns of $G$ must be the identity matrix. The next three columns must be the various ways of summing up the data bits. Hence, we get \begin{equation} G = \begin{pmatrix} 1 & 0 & 0 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 & 1 & 1 \\ \end{pmatrix}. \end{equation} This is clearly in the standard form.

# a function to create random messages
from random import randint
def random_message(k):
    return np.array([randint(0,1) for _ in range(k)], dtype=int)
    
random_message(4)

Task 1#

Complete the function hamming_encode that when passed a bit string, encodes it according to the Hamming code.

  • Parameters:
    • message a numpy.array, of shape (1,4) and dtype=int guaranteed to only contain 0 or 1.
  • Returns:
    • A numpy.array of shape (1,7) and dtype=int that is the the encoded version of the message
G = np.array([
    [ 1 , 0 , 0 , 0 , 1 , 1 , 0 ],
    [ 0 , 1 , 0 , 0 , 1 , 1 , 1 ],
    [ 0 , 0 , 1 , 0 , 1 , 0 , 1 ],
    [ 0 , 0 , 0 , 1 , 0 , 1 , 1 ]], dtype=int)

def hamming_encode(message):
    pass


m = random_message(4)
c = hamming_encode(m)

print('m =', m)
print('c =', c)

Identifying and correcting errors#

The previous section provided us with a way of encoding any message $m$ into a codeword $c$. Now, we discuss how to identify any errors that occur when the codeword is sent through the noisy channel.

Suppose that Bob receives the block $\tilde{c}$. This block could or could not be corrupted in some way. Every code has a certain tolerance for how many errors it can correctly identify and subsequently correct. The reason for this is that all Bob can do is try to determine the closest codeword $c$ to the received $\tilde{c}$. This is the best guess under the assumption that the each bit has an independent probability of being flipped. If there are too many errors, then the closest codeword to $\tilde{c}$ is no longer the one sent by Alice, and error correction fails.

The notion of closeness of bit strings can be formalized via the Hamming distance (not to be confused with the Hamming code; Hamming was just a very prolific scientist) .

The Hamming distance between two bit strings $s,t$ is the number of places where they have different symbols. This will be denoted $d_H(s,t)$.

The Hamming distance determines the number of substitutions or errors required to transform $s$ into $t$ or vice versa. Hence, it is a very natural definition for our purposes. The Hamming distance can be calculated as \begin{equation} d_H(s,t) = \sum_i (s_i \oplus t_i)\mod 2 \end{equation}

The brute force method of checking if an error and which error has occured is calculating the Hamming distance between $\tilde{c}$ and every possible codeword $\set{c^i}$. However, there are $2^k$ codewords and we would like a more efficient process.

This is done by exploiting the fact that the code has some structure in built. For instance, in the Hamming code we have recorded the parity of various subsets of the message bits. Bob should be checking if in $\tilde{c}$ the parity bits are consistent with the information bits, and if not what is the smallest change that resolves consistency. Generically, we need a set of constraints that all codewords obey.

These contraints naturally emerge when we recall that the code is a subspace of a vector space, and therefore it is the kernel of some linear transformation. This transformation, which we will call $H$, maps the codewords to zero.

The matrix $H \in \mathbb{F}_2^{(n-k) \times n}$ is a parity-check matrix for code $C$ if $C = \set{x \in \mathbb{F}_2^n: Hx^T = 0}$.

As there are $n-k$ rows to this matrix, the matrix places $n-k$ constraints on the elements of the codespace, leaving behind a $n - (n-k) = k$-dimensional space.

How does one obtain the parity-check matrix from the generator matrix? As the rows of the generator matrix form the basis of the code, we can use them to place sufficient constraints on the parity-check matrix.

Lemma: For a code $C$ with generator matrix $G$ and parity-check matrix $H$, we have the constraint $HG^T = 0$.

Proof: Recall that any codeword $c = mG$, so $c^T = G^Tm$. Now, by the definition of $H$, we have $Hc^T= 0$, or $HG^Tm = 0$. But this last statement is true for all $m$, which is only possible if $HG^T=0$.

Using this theorem, we give a procedure when $G$ is in standard form, using the following lemma.

Lemma: If $G = [I_k|A]$, then $H = [A^T | I_{n-k}]$, where $I_k$ is $k \times k$, $A$ is $k \times (n-k)$, $A^T$ is $ (n-k) \times k$ and $I_{n-k}$ is $(n-k) \times (n-k)$.

Proof: We need to show that $HG^T= 0$. We can compute, \begin{equation} HG^T= \begin{pmatrix} A^T & I_{n-k} \end{pmatrix} \begin{pmatrix} I_k \\ A^T \end{pmatrix} = A^TI_k + I_{n-k}A^T = A^T + A^T = 0. \end{equation} Hence, proved. The above lemma gives a very straightforward procedure for computing the parity check matrix.

For the Hamming code, the parity check matrix is \begin{equation} H=\begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 1 & 0 & 0 & 1 \\ \end{pmatrix}. \end{equation}

Question: Verify that $HG^T = 0$ for the Hamming code.

H = np.array([
    [1, 1, 1, 0, 1, 0, 0],
    [1, 1, 0, 1, 0, 1, 0],
    [0, 1, 1, 1, 0, 0, 1]], dtype=int)
    

Error syndromes#

We have learned that the parity-check matrix can be used to determine if $\tilde{c}$ is a codeword or not. But does it tell us more? Imagine Alice sends the codeword $c$ and it gets distorted by error $e$ to $\tilde{c} = c + e$. Now, the effect of the parity-check matrix is, \begin{equation} s = H\tilde{c}^T = Hc^T + He^T = 0 + He^T = He^T. \end{equation} The quantity $s$ informs us of what error has occured, and is called an error syndrome. It is a vector of length $n-k$. One can precompute a look-up table for $s$ vs $e$ and use this to determine the error at runtime.

Task 2#

Complete the function noise_channel that with probability p, flips each element of the input.

  • Parameters:
    • 'c': the transmitted codeword, a numpy.array of shape (1,n) and dtype=int guaranteed to only contain 0 or 1.
    • 'p': the probability of error, 'float' guaranteed to be between 0 and 1 inclusive.
  • Returns:
    • The corrupted codeword, A numpy.arrary of shape (1,n) and dtype=int guaranteed to only contain 0 or 1.
def noise_channel(c, p):
    pass

c = random_message(7)
corrupted_c = noise_channel(m, 0.5)
print('corrupted c =', corrupted_c)

Task 3#

Complete the function hamming_syndrome that when passed a corrupted codeword, determines the error syndrome.

  • Parameters:
    • corrupted_c: a numpy.array, of shape (1,7) and dtype=int guaranteed to only contain 0 or 1.
  • Returns:
    • Error syndrome: a numpy.array of shape (1,3) and dtype=int that is the error syndrome
def hamming_syndrome(corrupted_c):
    pass

Correcting the error#

Once one knows the error $e$ that has occured, to correct the codeword, one can simply note that $\tilde{c} + e = c + e + e = c$. So the codeword is straightforwardly obtained.

Task 4#

Compute a table of all possible syndromes and their corresponding one bit errors.

# write your code here

Task 5#

Complete the function hamming_correct that when passed a corrupted codeword, identifies the error syndrome, infers an error, and corrects for it.

  • Parameters:
    'corrupted_c': a numpy.array, of shape (1,7) and dtype=int guaranteed to only contain 0 or 1.
  • Returns:
    The fixed codeword, a numpy.arrary of shape (1,7) and dtype=int guaranteed to only contain 0 or 1.
def hamming_correct(corrupted_c):
    pass

Distance of a code#

A crucial quantity of interest when discussing codes, is their distance. The distance determines the number of errors that can be corrected. The distance is the minimum Hamming distance between any two codewords.

For code $C$ the code distance $d$ is

\begin{equation} d = \min_{c_1,c_2\in C} d_H(c_1,c_2). \end{equation}

This seems to be a tedious optimization problem. However, we can simplify the calculation in the following way. First, we define the weight of a bit string.

The weight of a string $s$, denoted $wt(s)$, is the number of ones in the string.

The weight can also be thought of as as the Hamming distance of $s$ from the all $0$ string, i.e. \begin{equation} wt(s) = d_H(s,0^n). \end{equation}

Next note that the zero message, $m = 0^k$, maps to the zero codeword $c_0 = 0^kG = 0^n$. Then, using the properties of the Hamming distance and the fact that the codespace is a vector space, and letting $c_1 = x_1G \ne c_2=x_2G$, we have \begin{align} d_H(c_1,c_2) &= d_H(x_1G, x_2G), \\ &= d_H(x_1G-x_2G,0^k), \\ &= d_H((x_1-x_2)G,0^k), \\ &= d_H(c_3,0^k), \\ &= wt(c_3), \end{align} where in the second last step we have defined $c_3 = (x_1-x_2)G$, which must exist in $C$ because $x_1-x_2$ is part of the message space, and $c_3 \ne 0$. So we have discovered that distance of a code, \begin{equation} d = \min_{c \in C} wt(c), \end{equation} is just the minimum weight string.

Another theorem simplifies the calculation even more.

Theorem: The distance $d$ of a code $C$ is equal to the minimum number of linearly dependent columns of the parity check matrix $H$ of the code.

The use of the distance of a code is given by the following theorem.

Theorem: A code with distance $d$ can

  • Correct at most $d-1$ erasure errors,
  • Detect at most $d-1$ bit flip errors,
  • Correct at most $\left\lfloor \frac{d-1}{2}\right\rfloor$ bit flip errors.

Therefore, a single quantity tells us the power of a code at detecting and correcting various types of errors.

Question: Compute the distance of the $[7,4]$ Hamming code, using both methods outlined above. Show $d = 3$.

If the distance $d$ of a code $C$ is known, then $C$ is denoted as $[n,k,d]$.

Question: Determine the distance $d$ of the $[n,1,d]$ repetition code.

Dual codes#

Recall that the generator matrix $G$ is a matrix whose rows form the basis of the code $C$. In other words, the only defining feature of $G$ is that it's rows are independent. Now, we also know that the rows of the parity-check matrix $H$ are also independent; otherwise, the kernel of $H$ will have dimension less than $k$ and could not be equal to the code $C$. Therefore, $H$ can also form a code, and such a code is called the dual code $C^\perp$ of $C$.

Given code $C$ with parity-check matrix $H$, the code with generator matrix $H$ is called the dual code to $C$ and denoted by $C^\perp$.

Another equivalent and perhaps more intuitive definition of the dual code is given by the fact that $HG^T=0$, which leads to the fact that

The dual code $C^\perp = \set{x \in \mathbb{F}_2^n : x\cdot c = 0, \forall c \in C}$.

Question: If code $C$ is a $[n,k]$ code, then what is the type of $C^\perp$. You can use the dimensions of the matrix $H$ to determine this.

An interesting category of codes are those which are self-dual. These are codes for which $C = C^\perp$, i.e. the generator matrix $G$ of $C$ and the parity-check matrix $H$ of $C$ are equal to each other. From the dimension of these matrices, this clearly implies that $n-k = k$ or $k = n/2$. Hence, $n$ must be even.

Question: The extended Hamming code is a $[8,4]$ code that is equal to the Hamming code with an additional column in the parity check matrix that computes the parity of all four message bits. Show that this code is self-dual.