# Advanced MNIST Example

In the previous example, we naively squashed the 2D images into 1D vectors. By doing so, we lost some relevant information encoded in the spatial correlations between pixels. To include spatial correlations (or temporal correlations in timeseries and images sequences), one typically resorts to "convolutional" layers, which essentially scan the input for particular patterns.

In this tutorial we will revisit the MNIST analysis with a convolutional neural network (CNN). The first steps are the same:

In [None]:
# Import the libraries
import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""


# Load the MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Scale the input data
train_images = train_images / 255.0
test_images = test_images / 255.0

The MNIST images are greyscale-valued, so they only have one channel. More generally, an image has 3 channels, so the model will expect an input size of $width \times height \times channels$. We therefore need to add one dimension to our train and test images:

In [None]:
train_images = np.expand_dims(train_images, 3)
test_images = np.expand_dims(test_images, 3)
print(train_images.shape, test_images.shape)

## Model construction

The procedure for constructing a convolutional neural network is the same as for a dense (fully-connected) neural network. Convolutional layers (`Conv2D`) can be added one-by-one, with pooling layers (`MaxPooling2D`) in between to condense the data size. With `Conv2D`, we have to specify the number of filters, the size of the kernel, the type of activation, and the rules handling boundaries. In the example below, each layer has 32 filters, a kernel size of 3x3 (`kernel_size=3`), and ReLU activation. The boundaries will be treated such that the size of the output is the same as the input (`padding="same"`). After 3 convolutional layers, we squash the input and feed it into a single fully-connected layer with softmax activation.

In [None]:
model = keras.Sequential([
    keras.layers.Conv2D(32, kernel_size=3, activation=tf.nn.relu, padding="same", input_shape=train_images[0].shape),
    keras.layers.MaxPooling2D(),
    keras.layers.Conv2D(32, kernel_size=3, activation=tf.nn.relu, padding="same"),
    keras.layers.MaxPooling2D(),
    keras.layers.Conv2D(32, kernel_size=3, activation=tf.nn.relu, padding="same"),
    keras.layers.Flatten(),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])

# Compile and print a summary
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
print(model.summary())

Note that the total number of parameters is only a third of what we had in the case of only fully-connected layers, even though we have more layers. This is because convolutional layers "reuse" their parameters when going through the input data, and so the number of parameters does not depend on the size of the input. The number of parameters for a single convolutional layer can be calculated as $ \left(K_w \times K_l \times C_{in} + 1 \right) \times C_{out}$, where $K_w$ and $K_l$ are the kernel width and length, $C_{in}$ is the number of input channels, and $C_{out}$ is the number of output channels. The $+1$ accounts for the biases that are added to each output channel. So for the first layer, we have $\left(3 \times 3 \times 1 + 1 \right) \times 32 = 320$ parameters, and for the second and third layers we have $\left(3 \times 3 \times 32 + 1 \right) \times 32 = 9248$ parameters each. The fully-connected layers takes an input of size $7 \times 7 \times 32 = 1568$, and produces an output of size $10$, so there are $1568 \times 10 + 10 = 15,690$ weights and biases involved in the last layer.

Because convolutional layers typically carry fewer parameters around than dense layers, you can keep stacking them at relatively low computational cost. This is what makes deep learning truly "deep". It is not uncommon to have CNN architectures with several tens of layers. 

## Training

Let's see how our CNN architecture performs during training:

In [None]:
model.fit(
    train_images, 
    train_labels, 
    validation_data=(test_images, test_labels),
    verbose=1,
    epochs=5)

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
predictions = model.predict(test_images)
print("Test accuracy: %.4f" % test_acc)

A test accuracy of around 99% is much better than the 97% we had before. To put this into perspective: we first had an error rate of 3%, now it is only 1%, which is 3x less! Exploiting the spatial correlations using convolutional layers really helps (as you could have intuitively guessed).

## Visualisation

Again, we can visualise the performance by plotting the images and corresponding labels/predictions, but this time it will be even harder to find any miss-classifications.

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=10)
for i in range(5):
    for j in range(10):
        n = i*10 + j + 100
        pred_num = np.argmax(predictions[n])
        if pred_num == test_labels[n]:
            colour = "g"
        else:
            colour = "r"
        axes[i, j].set_title("%d / %d" % (test_labels[n], pred_num), c=colour)
        axes[i, j].imshow(test_images[n, :, :, 0], cmap="gray")
        axes[i, j].axis("off")
plt.tight_layout()
plt.show()

## Final note

Using CNNs is not limited to image analysis. It works equally well for analysing 1D data like time-series, or 3D data like volumetric data or image sequences (movies). We will continue to use CNNs later when we analyse seismograms and GPS data.

# Exercise

Just like in the previous tutorial, experiment with various hyperparameters: number of `Conv2D` layers, `kernel_size`, number of filters, activations, etc.