Watch the video embedded here
Some (not particularly complete) notes on the CNN portion of the lecture.
2) He does what he calls: "correlating" or "convoluting". This means that he is performing matrix multiplications as the filter moves across the image.
Here's the image to be convoluted; specifically, looking only at the upper left edge of an image of a number '7':
The linear transformation we get as a result of this process looks like this:
Quote from the video, "It has highlighted the top edges," which we can see clearly. This is a major part of a CNN -- the transformation of the image to bring out recognizable features for the network to learn.
Continuing, here's how the Sobel filter transformed the whole image of the number '7' -- you can see it clearly (27:41 on the video). The whole image:
got turned into:
3) Next, he rotates his matrix (or filter, whatever you want to call it) 90 degrees each time, thus creating a total of 4 'edge filters.' These are to be used to find edges on the top, left, bottom, right:
4) (29:25) Then creates a 'diagonal filter' which I believe is a totally different matrix than the sobel filter (it's harder to recognize the output of the convoluting which emphasizes diagonals, at least to me.)
5) (30:09) Now, as a final step in preprocessing, he uses a technique called Max Pooling to down-sample the images of the '7's, for each of the eight transformed images he's created.
To do Max Pooling, he breaks each set of pixels transformed by a filter, into four equal sized squares, and replaces each set of pixels within these equal sized square, with the value for the single pixel with the brightest image. This makes more sense when you look at these screenshots:
6) Now, pretend that we did all the previous steps with the number '8' instead of '7'. ('7' was nice to show edges and max pooling, but we want to look at '8' now for the next step). And also, say we did this for the number '1' as well.
7) (31:57) At this point we have many different images of the number '8', all having been convolved and pooled. And we want to put them all together in some way. He aggregates all the pooled images (grouped by filter type), and takes an average for each pixel value in the each group. In other words, he takes all the pooled '8's, and from them creates an 'average' (which he also calls 'ideal'), representation of the number 8. Here's what that looks like:
(32:38) Then he does this same process for the '1's.
8) Next he goes back to the raw images, where no transformation has been done. He takes a raw number image, in this case it's an image from the set of '8's. Now that we have already built the set of seven ideal representations for the number '8', AND we have done the same for the number '1', we can see which filtered image looks more like our raw '8'. Hopefully the raw '8' looks more similar to one of the transformed '8's than the '1's. We can measure this using a simple sum of squares between the raw image and all the transformed images. (needs cleanup here?)
9) Then he applies this error-measuring process to all of the raw '8' images available, and creates a confusion matrix out of it. Looks like we did pretty good in only a few lines of code.
(46:50) The section ends with the question, "How do we make it better?"
- Improve features by choosing better 3X3 matrices/filters
- Use a filter that finds corners, by letting the output of a completed filters become input to another filter (aka, add more layers to your network. Save the max pooling for the end, though.)
- Not each filter is equally important
- Averaging them isn't the best