Lesson 2 Notes
In this lesson, we'll go over Jeremy's approach to entering the dogs vs. cats competition. We'll also directly look at the architecture of a neural network, talk about how weights are initialized and improved to provide accurate results, and we'll discuss building linear models in Keras. Finally, we'll talk about the role of activation functions in neural networks, and we'll learn how to fine-tune a pre-trained Vgg16 network to classify cats and dogs.
- 1 Teaching Approach
- 2 Jeremy's Solution to Dogs vs Cats Redux Competition
- 3 Pre-Trained Weights
- 4 Neural Network Basics
- 5 Gradient Descent
- 6 Cats vs Dogs and Finetuning
Before we start this lesson, we wanted to mention our approach to teaching this course. In the first lesson, we learned how to put together a state-of-the-art image classification tool in seven lines of code. You may be wondering why we asked you to do things like "finetune" and "fit" your model, before actually explaining what those things are.
Our philosophy in teaching this course is inspired by David Perkin's [Making Learning Whole], which advocates a top-down approach to teaching, as opposed to the bottom-up approach ubiquitous in mathematics. As an example, let's think about how baseball is typically taught. If we taught baseball how mathematics was taught, you would start by teaching the physics behind the trajectory of a parabola. After mastering classical mechanics, you would then move on to learn the materials science behind baseball and baseball bat design, and this approach would continue until decades later you understood every law governing the event we call baseball. Of course, this is not how baseball is actually taught (thank goodness). Baseball is taught by going down to the baseball diamond and starting to play.
This approach of throwing ourselves into the deep end and learning the details along the way is how we're approaching this course. We will eventually understand everything we need to know about finetuning and fitting models, but first we thought it necessary to teach you how to work with Keras in constructing an image classification model at the macroscopic level.
Tips for Learning
We purposely did not tell you how to do certain things in our first lesson, such as preparing the Kaggle submission or using the Kaggle-CLI. We do this because we believe that discovering how to do these things on your own is an important part of the learning process. We recommend the following when you encounter a problem:
- Spend 30 minutes on a problem on your own
- Afterwards, you must ask somebody for help
- If you have questions or problems, describe them in detail to accelerate the troubleshooting process
- Use the provided resources in this wiki
For more on the best practices in asking for help, please see [How to ask for Help].
Jeremy's Solution to Dogs vs Cats Redux Competition
Downloading the Data
If you're using a remote instance/server, it is highly recommended that you use the Kaggle-CLI (https://github.com/floydwch/kaggle-cli). After using the Kaggle-CLI to download the data, you'll need to unzip the files (train.zip and test.zip). This can be done using the unzip package. For a more detailed explanation, refer to this Wiki page.
When it comes to your data science projects, it's suggested that you prepare a "to do" list of the required steps, as it makes life far easier.
For the Cats vs. Dogs Redux competition, Jeremy's to-do list is as follows:
- Create validation and sample set
- Move images to appropriate directories
- Finetune and train
If you're using a Jupyter notebook (highly recommend you do), make markdown headings for each item on your to do list.
Jupyter notebooks have some handy tools:
- magic commands
- start with a % and select a command, i.e. cd, mkdir, etc.
- bash commands
- start with a ! and type any bash command
- why would you use it instead of the terminal?
- keeps a record of your bash commands
- allows for easily reproducible work and debugging
Doing all your work in a notebook is useful because it keeps a record of exactly what you did.
Preparing the Data
Looking at the data we just downloaded from the competition, you'll notice two folders, test and train, as well as one file sample_submission.csv. Inside of test and train are the samples we'll be using.
If you recall the section on data structure from the previous lesson:
- Training set: This is the data that our algorithm is going to use to fit parameters in order to make predictions
- Validation set: The data we use to fine tune our parameters
- Test set: The data we use to test our final model against. Since this data has not been seen by the model before, it is meant to simulate how it will perform against new data.
- Create a valid folder and move 2000 random samples from the train folder, into it.
- Create a sample folder.
- Inside of your sample folder, create a train folder and copy 200 random samples to it from train.
- Again, inside of your sample folder, create a valid folder and copy 50 random samples to it from train.
- For your valid, sample, and train folders:
- Create a cat folder and a dog folder
- Move all of your cat samples into the cat folder
- Move all of your dog samples into the dog folder
In the accompanying notebook for this lesson, Jeremy does this using python modules os, numpy and glob to move/copy randomly selected files and grab file names respectively. This can also be done in Bash using the ls, mv, cp, shuf, and head functions in junction with appropriate piping.
Finetune and Train
Thankfully, and perhaps unexpectedly, this step isn't too difficult.
All we need to do is to copy and paste the seven lines of code from Lesson 1 to build an image classifier for this new data set. This time however, we will also introduce a new line of code:
All we're doing here is saving the weights we learned after fine-tuning and fitting our model to this dataset. This has the obvious advantage of allowing us to skip that step every time we want to work with this model. It also allows us to keep track of different versions of our model: as we continue to train it, we may over-fit or end up with a model that is less desirable than a previous iteration. As such, it's beneficial to save your model weights after every unique fitting process.
Kaggle expects you to submit a CSV with:
- an id column, equivalent to a file/sample number
- a label/prediction column, equivalent to the model's prediction of the image being a dog
Most likely, you'll be submitting a lot. To make things easier, it is suggested you make a submission function, which you may call as many times as you wish.
The easiest way to submit your CSV when using mini-batches is to use Keras's model.predict_generator() which takes your mini-batches and returns all of the model's predictions/labels in a single array. Note that when creating your batches for submission, we need to set shuffle=False, because we need to make sure that the results are received in the same order as the files in the directory so we can correctly match the id to its predicted response.
We recommend that you read the [Keras] documentation as often as possible to familiarize yourself with these function calls, and what they are returning.
Note: Kaggle wants the probabilities, not labels, which model.predict_generator() returns by default. In order to avoid this, you must pass in class_mode=None to gen.flow_from_directory() when you make your mini-batches.
Once we get our results, preparing them for submission is a simple task of obtaining the probabilities we want (in this case the probability that the photo is a dog), and then formatting the results with the image id's into the desired submission csv. The accompanying redux notebook outlines Jeremy's approach to doing this.
Once you have your submission file ready, you could use the Kaggle-cli to upload the file from your remote server. However, a much simpler approach is to use the following:
from IPython.display import FileLink FileLink('data/redux/subm98.csv')
This will return a link to the file on your remote server in your browser. You can then click on it to download and submit in the normal fashion.
Dealing with Log Loss
The score that Kaggle gives you after submitting is based on the Log Loss function (which is described in depth in this article: Log Loss).
In Keras, we also call this "binary entropy", and it is a particular subset of "categorical cross-entropy" (particularly in Keras).
We need to be cautious when submitting our results to Kaggle. The model that we've built is likely to be over confident in its predictions, and as a result the probabilities it's going to return will be very close to 1 or 0. This is great when our predictions are correct, as the logarithm of 1 is of course zero. However, when we have misclassified with that level of confidence, we are taking the logarithm of 0. This is of course undefined, and we would expect this to throw an error. Fortunately, the wizards over at Kaggle made sure to avoid returning an error for such confident predictions by returning the logarithm of some arbitrarily small non-zero amount. But the logarithm of an arbitrarily small non-zero epsilon is a very large value, and therefore this over-confidence is going to throw off our score. A smarter choice is to compensate for this over confidence by clipping our results, i.e. to a range of 0.05 to 0.95. While this means we won't be getting zero values for our absolutely correct predictions, we will avoid generating large values in response to a confident incorrect prediction, and this results in a better score.
A few notes about training over multiple epochs (one full pass through the data set):
- Running more epochs can improve performance
- If you use multiple epochs, try to monitor and track the training results by separating them
- If your accuracy doesn't settle well, try decreasing the "learning rate"
We'll go over what learning rate is later, but you can change it for a vgg object with:
vgg.model.optimizer.lr = 0.01
Visualization of Results
It's often a good idea to analyze how well your model is doing through visualization. Fortunately, since we're dealing with images this is exceptionally easy. We'd like to look at the following examples:
- A few correct labels at random
- A few incorrect labels at random
- The most correct labels of each class
- The most incorrect labels of each class
- The most uncertain labels (those with probabilities near 0.5)
Below we do just this:
#1. A few correct labels at random correct = np.where(preds==val_labels[:,1]) idx = permutation(correct)[:n_view] plots_idx(idx, probs[idx])
These results are what we'd expect.
#2. A few incorrect labels at random incorrect = np.where(preds!=val_labels[:,1]) idx = permutation(incorrect)[:n_view] plots_idx(idx, probs[idx])
This helps us see what it's getting wrong, and we can see that in these instances it is reasonable for our model to have done so due to some odd angles or obscured images. One of these actually has both a cat and a dog!
#3. The images we most confident were cats, and are actually cats correct_cats = np.where((preds==0) & (preds==val_labels[:,1])) most_correct_cats = np.argsort(probs[correct_cats])[::-1][:n_view] plots_idx(correct_cats[most_correct_cats], probs[correct_cats][most_correct_cats])
# as above, but dogs correct_dogs = np.where((preds==1) & (preds==val_labels[:,1])) most_correct_dogs = np.argsort(probs[correct_dogs])[:n_view] plots_idx(correct_dogs[most_correct_dogs], 1-probs[correct_dogs][most_correct_dogs])
#3. The images we were most confident were cats, but are actually dogs incorrect_cats = np.where((preds==0) & (preds!=val_labels[:,1])) most_incorrect_cats = np.argsort(probs[incorrect_cats])[::-1][:n_view] plots_idx(incorrect_cats[most_incorrect_cats], probs[incorrect_cats][most_incorrect_cats])
This makes sense, these images are fairly unusual.
#3. The images we were most confident were dogs, but are actually cats incorrect_dogs = np.where((preds==1) & (preds!=val_labels[:,1])) most_incorrect_dogs = np.argsort(probs[incorrect_dogs])[:n_view] plots_idx(incorrect_dogs[most_incorrect_dogs], 1-probs[incorrect_dogs][most_incorrect_dogs])
These are even stranger images, but again make sense.
#5. The most uncertain labels (ie those with probability closest to 0.5). most_uncertain = np.argsort(np.abs(probs-0.5)) plots_idx(most_uncertain[:n_view], probs[most_uncertain])
And these images aren't very clear either.
Perhaps the most common way to analyze the result of a classification model is to use a confusion matrix. Scikit-learn has a convenient function we can use for this purpose:
cm = confusion_matrix(val_classes, preds)
We can just print out the confusion matrix, or we can show a graphical view (which is mainly useful for dependents with a larger number of categories).
Jeremy's solution is very easy to generalize to other image classification tasks such as the "State Farm Distracted Driver Detection" competition on Kaggle. Refer to the redux notebook to see step by step how Jeremy did this.
Why start with an ImageNet network?
Training a convolutional neural network with random weights would take significantly longer than finetuning a pre-trained set of weights. Pre-trained weights are useful because they already have features of the world (imageNet) encoded in their weights. They typically include lines, edges, curves, and many more useful low-level filters that identify particular parts of images. For example, below we can see a layer that is finding things like stripes and circles.
Obviously, these low level filters are useful in any image identification process, not just ImageNet. Therefore we use pre-trained weights because they have already learned these filters, and we can simply "fine-tune" the higher layers to change the mapping to whatever classification task we're attempting to do.
If you want a better idea of what an ImageNet CNN's layers might look like, this paper goes in depth on the subject: https://arxiv.org/abs/1311.2901
Before we understand what exactly fine-tuning is, we need to know what a neural network is first.
Neural Network Basics
A standard fully-connected neural network is essentially a series of matrix products. Here we'll break down the matrix multiplication in a neural network using Jeremy's spreadsheet method. Keep in mind that we're only going to demonstrate the matrix multiplication a few times, to condense the material.
As detailed above, a neural network is at its core a sequence of matrices that map an input vector to an output vector through matrix multiplication. The intermediate vectors in between each matrix are the activations, and the matrices themselves are the layers. Through a process we'll learn about called "fitting", our goal is to adjust the values of the matrices, which we call "weights", so that when our input vectors are passed into the neural network we are able to produce an output vector that is as close as possible to the true output vector, and we do this across multiple labeled input vectors. This is what makes up a training set.
Above, we started with randomly generated weights as our matrix elements. After performing all the operations and observing the outcome, notice how the activations output is significantly different than our target vector y. Our goal is to get as close to the target vector y as possible using some sort of optimization algorithm. Before using the optimization algorithm, it's suggested to start your weight values in a manner that makes the activations output at least relatively close to the target vector. This method is called weight initialization.
There are many weight initializers to choose from. In the lecture, Jeremy uses Xavier Initialization (also known as Glorot Initialization). However, it's important to note that most modern deep learning libraries will handle weight initialization for you.
In the previous section, we mentioned that the process of fitting weights to produce results approximate to our expectation depends upon an optimization algorithm. The most common optimization algorithm, and one that is ubiquitous throughout deep learning is Gradient Descent.
Standard Gradient Descent
Standard gradient descent is a method of iteratively selecting "parameters" (what we call weights in deep learning) that successively minimize what is known as a "loss function." The loss function is simply some method of determining how different the predicted output, as determined by prediction "parameters" and given input, are from the true output associated with the same inputs. A common loss function is sum of square errors, which is simply the sum of squares of the difference between predicted responses and true responses. This loss function is common in processes like linear regression. Another common loss function is the log-loss, which we have defined above. We commonly use this loss function in neural networks.
Since our loss function is essentially a measure of how well our predictions match with the expected values, the goal of our optimization algorithm is to minimize this value. Our prediction function has at a minimum two kinds of numerical values, the inputs that the function acts upon, and the "parameters" that dictate what we do to the input. Since we cannot change the inputs, we necessarily must minimize the loss function by choosing parameters that produce predictions that are closer to the expected values.
Gradient descent is a method of iteratively "improving" upon our initial parameter values (initialized through some process) in order to minimize the loss function. We do this by calculating the partial derivatives of the loss function with respect to each parameter, and updating the parameters by taking some step in the direction opposite its derivative. When we do this across all parameters, we have (hopefully) updated our parameters in such a way that the predictions determined by these new parameters decrease the loss function. We shall see in later lessons that some problems can arise using this method, and how we can address it. The "gradient" in this process is the vector that describes how the loss function is changing in regards to each parameter. Another value to be mindful of, which we mentioned earlier, is the learning rate. This value dictates how large of a step we take when updating our parameters, and it is what's known as a "hyper-parameter".
Linear Regression Example
This may sound more complicated than it actually is. As an example, we can see gradient descent in action in linear regression (fitting a line). This is explained in great detail in the SGD notebook.
In short, if our line is ax+b, where a and b are the "parameters", our goal is to calculate what a and b need to be in order to minimize our loss function. As mentioned earlier, the loss function is essentially a mathematical function that will be high if your guess (in our case a and b), is bad and low if the guess is good. In linear regression, we use sum of square errors as our loss function. At each iteration, we calculate the derivative of this function with respect to a and to b. This tells us how the loss function changes with respect to these two parameters. If the derivative with respect to a is positive, then increasing a increases the loss function. Therefore, we decrease a. If it is negative, increasing a decreases the loss function, and so we increase a. In either case, we are moving in the opposite direction of the derivative of a. We do this for both a and b over and over again until we are satisfied with our choices. The sgd notebook allows you to see this in action, and we highly recommend you spend some time playing with it to familiarize yourself with gradient descent.
Extending to Neural Networks
The power of this optimization technique is that we have initialized our linear model's parameters randomly, and through iteration of gradient descent, we can arrive at an optimal solution. The key take away here is that the same process we use to estimate parameters for something as simple as a linear function can also be used to estimate parameters for something as complicated as a neural network, with some caveats. In linear regression, we are typically always able to arrive at the best solutions. Due to the complexity of neural networks, which have millions of parameters, this is almost never the case. More often than not we will never find a best minimum for all parameters. This is why we do not simply run gradient descent on neural networks until some termination condition is satisfied, as we would with linear regression. Rather, we run gradient descent until we are satisfied with the results.
Stochastic Gradient Descent
Another key distinction is that we have so far only talked about "standard" gradient descent. In standard gradient descent, the loss function is evaluated over all training data available. Unfortunately, due to computational limits this is impossible with neural networks. We therefore use what is known as stochastic gradient descent. This works by taking a random sample, or "mini-batch" of our data, produce predictions on that mini-batch, and use their true values to evaluate the loss function. We then update our weights as usual, and then move to the next mini-batch and repeat the process. The "stochastic" element of this process is the random element introduced by evaluating the loss function on different mini-batches, as opposed to the entire training set. For each mini-batch, the loss function will be slightly different, and it will also be different from the loss function on the entire training set. However, it turns out that this doesn't matter! The magic of stochastic gradient descent is that you can update the weights on random mini-batches of the training set, and your results will be the same as if you had updated your weights on the true loss function across the whole training set.
As an example, we will show how keras implements gradient descent, and we'll use it in the context of linear regression.
x = random((30,2)) y = np.dot(x, [2., 3.]) + 1. x[:5]
All we've done here is create values y that are linearly related to x1, x2 through the relationship y = 2*x1 + 3*x2 + 1. In keras, a simple linear model is known as a Dense layer. Passing our inputs and expected outputs x and y, Keras will initialize some form of random weights. We will tell it to optimize using SGD with a learning rate of 0.1, to minimize loss function mean squared error (mse):
lm = Sequential([ Dense(1, input_shape=(2,)) ]) lm.compile(optimizer=SGD(lr=0.1), loss='mse')
We can see how far off the initialized weights are by evaluating our loss function.
lm.evaluate(x, y, verbose=0)
Next, we will run stochastic gradient descent. The fit function, as called below, will at each iteration of sgd use one input/output pair in the training set to evaluate the loss function and update the parameters. A single pass through the entire training set is called an epoch. Below we run through 5 epochs.
lm.fit(x, y, nb_epoch=5, batch_size=1)
Let's check our evaluation function:
lm.fit(x, y, nb_epoch=5, batch_size=1)
Which is less than it was before, as expected!
We can also take a look at the weights after fitting. We would expect them to be very close to the true parameters (2.0, 3.0, and 1.0).
And indeed they are. Had we used larger batch sizes than 1, we would anticipate our weights to converge to the true weights much faster.
Cats vs Dogs and Finetuning
We now know enough to understand how to modify Vgg16 to create a model that will output predictions for Cats and Dogs. Follow along in the lesson 2 notebook to be able to reproduce what we're going to do.
Adding a Dense Layer
The dense layer we used in the last section mapped input vectors to a single output. We can easily change this to output to a vector of arbitrary length, noting that the structure of the weights for such an output will be just a matrix like we talked about earlier.
The last layer of Vgg16 outputs a vector of 1000 categories, because that is the number of categories the competition asked for. Of these categories, some of them certainly correspond to cats and dogs, but at a much more granular level (specific breeds). We could manually figure out which of these categories are cats and which are dogs, and simply write some code that will translate the imagenet classifications into a cat and dog classification. But that would be inefficient, and we would miss some key information.
A better approach would be to simply add a Dense layer on top of the imagenet layer, and train the model to map the imagenet classifications of input images of cats and dogs to cat and dog labels. Why is this better than manually doing it? Because the neural network will utilize all information available to it from the imagenet classifications, as opposed to simply mapping cat categories to cat and dog categories to dog. For example, a picture of a German Shepherd with a bone would likely have strong probabilities in the German Shepherd category and the bone category. If we only map the dog categories to dog, and throw out the rest, then we lose this other information useful in classification, such as whether or not a bone is in the image.
Our overall approach here will be:
- Get the true labels for every image
- Get the 1,000 imagenet category predictions for every image
- Feed these predictions as input to a simple linear model.
The lesson 2 notebook shows us again how to grab our batches. One important point to note is that the labels we get from our batches need to be one-hot encoded. One hot encoding simply takes categorical variables, and converts them into a matrix where each column represents a category. If an image belongs to category A, that image's row in the matrix has a 1 in the category A column and 0's every else. One important reason we take the step to convert our labels to these vectors is because this is the same shape as the output of our dense layer. Therefore for training purposes we need to one-hot encode them. Now we can predict.
trn_features = model.predict(trn_data, batch_size=batch_size) val_features = model.predict(val_data, batch_size=batch_size)
If we pass our cats vs. dogs training/validation to our model using predict, we'll end up with the 1000 imagenet classification probabilities. Next, we can make the following model:
#1000 inputs, since that's the saved features, and 2 outputs, for dog and cat lm = Sequential([ Dense(2, activation='softmax', input_shape=(1000,)) ]) lm.compile(optimizer=RMSprop(lr=0.1), loss='categorical_crossentropy', metrics=['accuracy'])
This will take our 1000 imagenet probabilities, and map them to to outputs, one for a cat and one for a dog. Upon initialization, the model has no idea how to do this. However, using our one-hot encoded labels, we can train this layer using our imagenet predictions as input. Now all we have to do is fit as we did before.
batch_size=64 lm.fit(trn_features, trn_labels, nb_epoch=3, batch_size=batch_size, validation_data=(val_features, val_labels))
Easy! This method of simply attaching dense layers to pre-trained models to classify desired categories is often how many deep learning amateurs achieve astonishing results, such as classifying skin lesions. It's important to note how simple this was; we really didn't do anything magical with some fancy library, we simply trained a linear model from a pre-trained model's output. By doing just that, we've achieved a classification accuracy greater than 97%.
Can we do even better than this? The answer is yes, and it involves finetuning. Before we get to that, however, we need to introduce one last piece to our understanding of neural networks: the activation layers.
When we described a neural network earlier, we did so explaining it as a series of matrix multiplications, which we called layers, that transform an input vector. At each intermediate step, we called the output of these layers an activation layer. However, upon reflection, this seems strange. If a neural network was simply a sequence of matrix multiplications, then the entire process is simply a linear process and can be represented by a single matrix.
This is of course not the case. The reason we brought attention to the intermediate vectors before is because in a neural network, there are actually nonlinear functions in these activation layers that operate on the output of the previous layer. This new vector is then fed into the next matrix. Some commonly used activation functions are tanh, the sigmoid function, and relu (short for rectified linear unit). Relu is in fact the most common, and while it sounds like some arcane function, it is simply the function max(0,x). Observe now that every layer in Vgg16 has activation functions, which simply tells keras how to transform the output of a particular linear layer.
It turns out that this combination of linear transformations and non-linear activations is capable of approximating anything.
If we observe the last layer of Vgg16, we can see that the last layer is simply a dense layer that outputs 1000 elements, which is as we'd expect. Therefore, it seems somewhat unreasonable to stack a dense layer meant to find cats and dogs on top of one that's meant to find imagenet categories, in that we're limiting the information available to us by first coercing the neural network to classify to imagenet before cats and dogs.
Instead, let's remove that last layer.
model.pop() for layer in model.layers: layer.trainable=False
And now let's add on a new layer for cats and dogs.
NB: The above 3 steps is all the vgg16's finetune() method does!
We're now able to use all 4096 activations (whatever they may be!) from the previous layer to classify to cats and dogs. This implicitly should provide us with richer information with which to classify.
The next steps are to simply train as we've done before, and if you do this you'll find that your model will produce better results than if you had simply added another layer on top.