This project is a 2 layer neural network that learns to recognise handwritten numbers by being fed the famous MNIST dataset as a .csv. This project uses no tensorflow, pytorch etc., solely math.
The script includes a set of tweakable constants:
ITERATIONS
: How many iterations the neural network will loop.DISPLAY_REG
: How often the script will display the neural network's current accuracy.IMG_SIZE
: The number of pixels in the input image.DATASET_PARTITION
: Where the data should be parted for crossvalidation.TEST_PREDICTIONS
: How many times the neural network will be tried for predictions upon completing training.DATASET_FILE
: Input path for the data.
During forward propagation, the neural network takes images and learns to create predictions out of them:
-
A0: This is the input layer (layer 0) of the neural network. It simply receives the
IMAGE_SIZE
number of pixels into each node. -
Z1: Unactivated first layer. Z1 is obtained by applying a weight obtained from the connections between nodes in the prior layer (W1) and a bias (b1) to the input layer (A0). Or, Z1 = W1 * A0 + b1.
-
A1: First layer. A1 is obtained by putting Z1 through an activation function. The activation function I use is Exponential Linear Unit, or ELU.
-
Z2: Unactivated second layer. Z2 is obtained by applying a weight obtained from the connections between nodes in the prior layer (W2) and a bias (b2) to the prior layer (A1). Or, Z2 = W2 * A1 + b2.
-
A2: Second and final layer. A2 is obtained from passing Z2 through an activation function. This time we're using softmax, which will assign a probability to each node in this output layer.
Backward propagation is a method of improving the algorithm as it learns. This is done by taking the prediction, measuring how much it deviated from the image's label and working backwards.
-
dZ2: A measure of the error in the second layer. It's obtained by taking the predictions and subtracting the labels from them. For that, we one-hot encode the label as Y. dZ2 = A2 - Y
-
dW2: The derivative of the loss function with respect to the weights in layer 2. dW2 = 1/m * dZ2 * A1.T. (Where .T is transposition of a matrix or vector)
-
db2: This is the average of the absolute error. db2 = 1/m * Σ dZ2.
-
dZ1: A measure of the error in the first layer. This formula essentially performs forward propagation in reverse. dZ1 = W2.T * dZ1 * g'() where g'() is the derivative of the activation function.
-
dW1: The derivative of the loss function with respect to the weights in layer 1. dW1 = 1/m * dZ1 * X.T.
-
db1: db1 = 1/m * Σ dZ1.
After successful forward & backward propagation, the algorithm updates a hyperparameter α in this fashion:
- W1 := W1 - αdW1
- b1 := b1 - αdb1
- W2 := W2 - αdW2
- b2 := b2 - αdb2
α, being a hyperparameter, isn't set by gradient descent, but by the end-user. α can be interpreted as the learning rate.
After that, the algorithm loops back to forward propagation.