Neural network with numpy

Neural networks are a wonderful machine learning algorithm. They are called neural networks because they are loosely based on how the brain’s neurons work, which can make them seem intimidating. However, they are essentially a group of linear models. There is a lot of good information about the math and structure of these algorithms so I will skip that here. Instead, I will outline the steps to writing one in python with numpy. The code here is heavily based on the neural network code provided in ‘Programming Collective Intelligence’, I tweaked it a little to make it usable with any dataset as long as the input data is formatted correctly.

First, we can think of every neuron as having an activation function. This function determines whether the neuron is ‘on’ or ‘off’ – fires or not. We will use the sigmoid function, which should be very familiar because of logistic regression. Unlike logistic regression, we will also need the derivative of the sigmoid function when using a neural net.

import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
# derivative of sigmoid
# sigmoid(y) * (1.0 - sigmoid(y))
# the way we use this y is already sigmoided
def dsigmoid(y):
    return y * (1.0 - y)

Much like logistic regression, the sigmoid function in a neural network will generate the end point (activation) of inputs multiplied by their weights. For example, lets say we had two columns (features) of input data and one hidden node (neuron) in our neural network. Each feature would be multiplied by its corresponding weight value and then added together and passed through the sigmoid (just like a logistic regression). To take that simple example and turn it into a neural network we just add more hidden units. In addition to adding more hidden units, we add a path from every input feature to each of those hidden units where it is multiplied by its corresponding weight. Each hidden unit takes the sum of it’s inputs * weights and passes that through the sigmoid resulting in that unit’s activation.

Next we will set up the arrays to hold the data for network and initialize some parameters.

class MLP_NeuralNetwork(object):
    def __init__(self, input, hidden, output):
        """
        :param input: number of input neurons
        :param hidden: number of hidden neurons
        :param output: number of output neurons
        """
        self.input = input + 1 # add 1 for bias node
        self.hidden = hidden
        self.output = output
        # set up array of 1s for activations
        self.ai = [1.0] * self.input
        self.ah = [1.0] * self.hidden
        self.ao = [1.0] * self.output
        # create randomized weights
        self.wi = np.random.randn(self.input, self.hidden)
        self.wo = np.random.randn(self.hidden, self.output)
        # create arrays of 0 for changes
        self.ci = np.zeros((self.input, self.hidden))
        self.co = np.zeros((self.hidden, self.output))

We are going to do all of these calculations with matricies because they are fast and super easy to read. Our class will take three inputs; the size of the input layer (# features), the size of the hidden layer (variable parameter to be tuned), and the number of the output layer (# of possible classes). We set up an array of 1s as a placeholder for the unit activations and an array of 0s as a placeholder for the layer changes. One important thing to note is that we initialized all of the weights to random numbers. It’s important for the weights to be random otherwise we won’t be able to tune the network. If all of the weights are the same then all of the hidden units will be the same and you’ll be screwed.

So now it’s time to make some predictions. What we will do is feed all of the data forward through the network with the random weights and generate some (bad) predictions. Later, each time the predictions are made we calculate how wrong the predictions are and in what direction we need to change the weights in order to make the predictions better (i.e. error). We will do this many, many … MANY times as the weights get updated so we’ll make a feed forward function that can be called over and over again.

def feedForward(self, inputs):
   if len(inputs) != self.input-1:
        raise ValueError('Wrong number of inputs you silly goose!')
    # input activations
    for i in range(self.input -1): # -1 is to avoid the bias
        self.ai[i] = inputs[i]
    # hidden activations
    for j in range(self.hidden):
        sum = 0.0
        for i in range(self.input):
            sum += self.ai[i] * self.wi[i][j]
        self.ah[j] = sigmoid(sum)
    # output activations
    for k in range(self.output):
        sum = 0.0
        for j in range(self.hidden):
            sum += self.ah[j] * self.wo[j][k]
        self.ao[k] = sigmoid(sum)
    return self.ao[:]

The input activations are just the input features. But, for each other layer the activations become the sum of the previous layers activations multiplied by their corresponding weights fed into the sigmoid.

On the first pass our predictions will be pretty bad. So we’ll use a very familiar concept, gradient descent. This is the part that I get excited about because I think the math is really clever. Unlike gradient descent for a linear model we need to use a little bit of calculus for a neural network. Which is why we wrote the function for the derivative of the sigmoid function at the beginning.

Our backpropagation algorithm begins by computing the error of our predicted output against the true output. We then take the derivative of the sigmoid on the output activations (predicted values) in order to get the direction (slope) of the gradient and multiply that value by the error. Which gives us the magnitude of the error and which direction the hidden weights need to be changed in order to correct it. We then move on to the hidden layer and calculate the error of hidden layer weights based on the magnitude and error calculated previously.

Using that error and the derivative of the sigmoid on the hidden layer activations we calculate how much and in which direction the weights need to change for the input layer.

Now that we have the values for how much we want to change the rates and in what direction we move on to actually doing that. We update the weights connecting each layer. We do this by multiplying the current weights by a learning rate constant and the magnitude and direction for the corresponding layer of weights. Just like in linear models we use a learning rate constant to make small changes at each step so that we have a better chance at finding the true values for the weights that minimize the cost function.

def backPropagate(self, targets, N):
	"""
    :param targets: y values
    :param N: learning rate
    :return: updated weights and current error
    """
    if len(targets) != self.output:
        raise ValueError('Wrong number of targets you silly goose!')
    # calculate error terms for output
    # the delta tell you which direction to change the weights
    output_deltas = [0.0] * self.output
    for k in range(self.output):
        error = -(targets[k] - self.ao[k])
        output_deltas[k] = dsigmoid(self.ao[k]) * error
    # calculate error terms for hidden
    # delta tells you which direction to change the weights
    hidden_deltas = [0.0] * self.hidden
    for j in range(self.hidden):
        error = 0.0
        for k in range(self.output):
            error += output_deltas[k] * self.wo[j][k]
        hidden_deltas[j] = dsigmoid(self.ah[j]) * error
    # update the weights connecting hidden to output
    for j in range(self.hidden):
        for k in range(self.output):
            change = output_deltas[k] * self.ah[j]
            self.wo[j][k] -= N * change + self.co[j][k]
            self.co[j][k] = change
    # update the weights connecting input to hidden
    for i in range(self.input):
        for j in range(self.hidden):
            change = hidden_deltas[j] * self.ai[i]
            self.wi[i][j] -= N * change + self.ci[i][j]
            self.ci[i][j] = change
    # calculate error
    error = 0.0
    for k in range(len(targets)):
        error += 0.5 * (targets[k] - self.ao[k]) ** 2
    return error

Alright, lets tie it all together and create training and prediction functions. The steps to training the network are pretty straight forward and intuitive. We first call the ‘feedForward’ function which gives us the outputs with the randomized weights that we initialized. Then we call the backpropagation algorithm to tune and update the weights to make better predictions. Then the feedForward function is called again but this time it uses the updated weights and the predictions are a little better. We keep this cycle going for a predeterimined amount of iterations during which we should see the error drop close to 0.

def train(self, patterns, iterations = 3000, N = 0.0002):
    # N: learning rate
    for i in range(iterations):
        error = 0.0
        for p in patterns:
            inputs = p[0]
            targets = p[1]
            self.feedForward(inputs)
            error = self.backPropagate(targets, N)
        if i % 500 == 0:
            print('error %-.5f' % error)

Finally, for the predict function. We just simply call the feedForward function which will return the activation of the output layer. Remember, the activation of each layer is a linear combination of the output of the previous layer * the corresponding weights pushed through the sigmoid.

def predict(self, X):
    """
    return list of predictions after training algorithm
    """
    predictions = []
    for p in X:
        predictions.append(self.feedForward(p))
    return predictions

That’s basically it! You can see the full code here.

I ran this code on the digit recognition dataset provided by sklearn and it finished with an accuracy of 97%. I’d say that’s pretty successful!

             precision    recall  f1-score   support

          0       0.98      0.96      0.97        49
          1       0.92      0.97      0.95        36
          2       1.00      1.00      1.00        43
          3       0.95      0.88      0.91        41
          4       0.98      1.00      0.99        47
          5       0.96      1.00      0.98        46
          6       1.00      1.00      1.00        47
          7       0.98      0.96      0.97        46
          8       0.93      0.80      0.86        49
          9       1.00      0.91      0.95        46

avg / total       0.97      0.95      0.96       450

Improving our neural network performance

Show me the code!

Now that we have created our artisan handcrafted neural network we should improve it with some modern techniques that a bunch of really smart people came up with. When I was making these improvements I used the kaggle competition on the MNIST dataset for my benchmarks. That way I could compare my performance to that of other peoples neural networks. I could also check my scores against other tried and true methods listed here. The original neural network that I created for the last post got 86% on the full MNIST dataset and this new one gets 96%, which is right in line with the multilayer perceptron benchmarks on LeCun’s website and paper. And I am very happy about that! There is still a lot of room for improvement but starting with 96% without using any feature engineering or deep learning techniques is very encouraging.

Nice!

The new code can be found here. I will go over the improvements as they show up if you were to scroll through the script from top to bottom so it should be easy to follow along. One thing that I won’t talk about here is the optimization via numpy. What that means is that I took out a lot of the for loops and replaced them with numpy functions like, numpy.dot() or just used + - * on the arrays directly because numpy will take care of the looping internally.

One thing to keep in mind is that most of these improvements have the effect of keeping the weights low (closer to 0). For the same reason that regularization works for regression, having low weight values in a neural network can help it generalize better. Since it’s very easy to overfit with a neural network anything we’ll take whatever we can.

Activation functions

The first thing that I did was add two more activation (transfer) functions that we can use. Each one has certain advantages over the logistic sigmoid that we started with. The biggest improvement came from changing the hidden layer activation function from the logistic sigmoid to the hyperbolic tangent. Both are considered sigmoid functions but the logistic is a range of (0, 1) and the hyperbolic tangent (tanh) has a range of (-1, 1). The idea here is that since the tanh function is centered around 0 the outputs that it produces will, on average, be closer to 0. The outputs from the logistic sigmoid will always be greater than 0 so the mean of the outputs will also be greater than 0.

The next activation function is called softmax. This one is only beneficial in the output layer and only when the classes are mutually exclusive. It forces the output of the neural network to sum to 1, so that they can represent the probability distribution across the classes. This way the network ‘knows’ that it can’t give equal probability to the classes in it’s output. Pretty neat!

def softmax(w):
    e = np.exp(w - np.amax(w))
    dist = e / np.sum(e)
    return dist

def tanh(x):
    return np.tanh(x)

def dtanh(y):
    return 1 - y*y

The best part is that we can swap these activations into our back propagation algorithm with very few changes. In order to use the tanh function in our hidden layer all we have to do is swap it out for the sigmoid.

sum = np.dot(self.wi.T, self.ai)
self.ah = tanh(sum)

When we calculate the gradient for the tanh hidden units we will just use the new tanh derivative that we defined earlier in place of the logistic sigmoid derivative.

error = np.dot(self.wo, output_deltas)
hidden_deltas = dtanh(self.ah) * error

To use the softmax output layer we will make the most drastic changes. We will go from this

output_deltas = dsigmoid(self.ao) * -(targets - self.ao)

to this

output_deltas = -(targets - self.ao)

Again, I am skipping the math in these posts and just focusing on readable python code and a higher level of understanding. But, essentially this is what is happening: if you were to work out the gradient descent algorithm with the derivative of the softmax function you will end up cancelling terms and arrive at -(t - yhat) for the error calculation, where t is the true value and yhat is the predicted value. Awesome!

If I remember right just switching out these activation functions gave me a few percentage points of improvement.

Initializing weights

In our previous neural network we simply initialized the weights with some random numbers. Which is good because it breaks the symmetry but there is a still a better way. We want to try and activate the sigmoid functions in their linear region so that their derivatives provide enough gradient for our learning to continue. In other words, if the output of a unit is close to the minimum or maximum of the sigmoid function it’s derivative will be flat and our network will learn really slowly (there will be no gradient to descend). So how do we fix this?

This part requires some ‘coordination’ between the input data and the weights for it to be effective. For the input data, we just need to scale them to have a mean of 0. So then we will draw the weights randomly again but this time we will tell numpy to give them a mean of 0 and a standard devation of the negative square root of the size of the layer feeding into the node.

input_range = 1.0 / self.input ** (1/2)
output_range = 1.0 / self.hidden ** (1/2)
self.wi = np.random.normal(loc = 0, scale = input_range, size = (self.input, self.hidden))
self.wo = np.random.normal(loc = 0, scale = output_range, size = (self.hidden, self.output))

Shuffling training examples

This next tip was probably the most simple and effective improvement in the code. On each iteration during training we will now shuffle the order of the data before it is fed into the network. Networks learn the fastest from the most unexpected sample. Lets say that all of our data was neat and organized. All of our ‘ones’, ‘twos’, and ‘threes’ were grouped together. If we fed the data into the network like this it will get really good at classifying ‘ones’, but then once it gets it’s first ‘two’ it will have no way of even getting close to classifying it. The network will then have to start learning ‘twos’ and forget about ‘ones’. If we randomize the inputs on every iteration the network will have an easier time creating weights that can generalize between all of the classes.

Adding this to our code is as easy as …

import random

def fit(self, patterns):
    for i in range(self.iterations):
        error = 0.0
        random.shuffle(patterns)
        for p in patterns:
            feed_forward(X)
            backprop_function(y)

Where patterns is the list of X and y values for the training dataset.

Regularization

Keeping in line with the overall theme of low weight values another handy trick is to add regularization in the form of weight decay. This is very similar to the l2 regularization used in linear models. For this neural network I initialized a regularization term for both the input to hidden weights and the hidden to output weights. This way there is more flexibility for fine tuning. Basically, regularization introduces a penalty for large weights which will in turn push the values of the weights toward zero. Adding this to the network is very easy and straightforward.

# update the weights connecting hidden to output, change == partial derivative
change = output_deltas * np.reshape(self.ah, (self.ah.shape[0],1))
regularization = self.l2_out * self.wo
self.wo -= self.learning_rate * (change + regularization) + self.co * self.momentum
self.co = change

# update the weights connecting input to hidden, change == partial derivative
change = hidden_deltas * np.reshape(self.ai, (self.ai.shape[0], 1))
regularization = self.l2_in * self.wi
self.wi -= self.learning_rate * (change + regularization) + self.ci * self.momentum
self.ci = change

As you can see we just added a regularization term in the back propagation algorithm which is added to the change (partial derivative) variable which in turn increases the amount that each weight is decreased by on every iteration. The self.l2_in and self.l2_out parameters are parameters which need to be tuned with some cross validation so it can get pretty time consuming to find the optimal values. With the right data it can be well worth it though.

No more overfitting!

So there are the four things that have greatly improved the performance of my neural network. Obviously there is still a lot that can be added but these offer pretty big improvements for very little effort.

Just like the last neural network post, I did not go into the math behind all of this. If you would like to take your understanding of neural networks to the next level the Stanford deep learning tutorial is my favorite website right now. It offers a much more indepth look at all of the algorithms for neural networks than my posts here. I find it very helpful to match each equation in the ‘Multi-Layer Neural Network’ tutorial with each snippet of code in my neural network script. It makes the underlying math much easier to digest.

Additionally, many more ways to improve the training of neural networks are outlined in ‘Efficient Backprop’ by LeCun et al. Most of what I outlined here came from that paper.

The biggest takeway from all of these tips is that a mean near zero will make you a hero.

My machine learning library

I am in the process of creating a machine learning library geared toward new users. It will not be the fastest library but the idea is to write the code in a very clear and easy to understand way so that people can go through and see exactly what each algorithm is doing. Hopefully it will be a resource that some people find useful.

Written on February 13, 2016