# Lesson 5 Outline

From Deep Learning Course Wiki

## Contents

## Intro:

### Explaining how Jeremy's model have scored (0.98944):

- The notebook containing this model is called [ dogscats-ensemble]
- This score was reached by using nearly all the same techniques we have covered.
- Not full dataset is used, no psuedo-labeling and not the best Data Augmentations approach.
- The important step here is that I have added: Batch Normalization to the VGG model.
- Batch Normalization reduces overfitting and increase training speed by 10x or more.
- Since, Batch Norm helps with reducing overfitting, we can use less DropOut.
- DropOut can destroy the data in between the layers.
- Why Batch Norm have not been added already?
- Ans: Because it didn't exist at the time.
- Weight = Mean + Variance.

### Batch Normalization:

- It normalizes each layer, the same way we can normalize data at the input layer.
- There are some additional steps that Batch Norm offers to make it work with SGD(the activations):
- Adds 2 more trainable parameters to each layer.
- One for multiplying the activations and set an arbitrary Standard Deviation.
- The other for adding all the activations and set an arbitrary Mean.
- BN (Batch Norm) doesn't change all the weights, but only those two parameters with the activations. This makes it more stable in practice.

- Having Batch Norm added, can allow us to increase the Learning Rate, since BN will allow our activations to make sure it doesn't go really high or really low.

- Adds 2 more trainable parameters to each layer.

## Collaborative Filtering:

- Previously, we mentioned on the Excel Sheet how we can create a dot-product and bias added to the User-Reviews with MovieIds. To Calculate the recommendation with a gradient descent given by Excel Sheets, Solver.
- We have transformed the shape of the Movilens dataset into:
- We have movies given as columns
- User Ids with their reviews given in respect to the movies.

- Lesson 4 Notebook, shows how we created the Bias with the embedding_layers:
- user_in
- movie_in

- This model that we created gives us a descent model, with adding a little regularization.
- Loss: 0.7979

### What is an embedding?:

- Brief: It's a shortcut one-hot encoding followed by a matrix product.

### Compare & Contrast Between the NN model and the other model built:

- We take the bias this time by picking the top 2000
- Then we will picking components from the latent factors.
- What are latent factors?
- ans: latent factors can be considered as 'hidden' features to distinguish them from observed features.

- What about the components, (PCA)?
- PCA: It looks through a matrix, what are the columns we can add together, because they seem to go in the same direction.
- In this case, we use Sklearn's implementation to pick 3 components from 50 latent factors.

- The important thing about PCA to understand is that it allows us to squish the 50 cols to only 3 to capture the information.

- Here is an example of Component 1 from our PCA:
- Top 10 Movies Comp Figure:

- Bottom 10 Movies Figure:

- PCA Visualized:

### Keras Functional API:

- We are shifting away from building sequential models, since our problem fits into a different structure.

- Through the Embedding layer we flatten out the data we have to fit into the model. Which we are going to make using functional API.

- Why not only use 3 latent factors, rather than 50?
- ans: Our model wouldn't be that accurate. We need more information to make a compact/useful component for the recommendations.

## Natural Language Processing:

- The goal of this section was not mainly to do collaborative filtering, but to use this as a way to learn about embeddings, so that we can learn to do NLP
- The goal for fast.ai and the course here is to introduce the shape and structure of these problems that can be applied elsewhere. Collaborative filtering is a powerful approach in industry, but NLP can go much further.
- For example through reading lots of Journals and Patient Medical history data, a fantastic diagnostic tool can be created.

### Sentiment Analysis:

- Sentiment Analysis means, picking a part of text, it can be a sentence, paragraph, a word and determining if this is a negative or positive sentiment.
- We use a dataset that already have been tested and researched on, which is the IMDB movie database.
- Here is a paper about the IMDB dataset: I. Maas, Ng. Et al Learning Word Vectors for Sentiment Analysis
- Note: the IMDB dataset is given to us as Word Indexes, what that means that each word has been altered into an id, which represents it's index. Also, these indexes have been ordered by how frequently they appear in the corpus.

- Imdb vectors figure:

- Our Task is to take these reviews and determine if they are positive or negative:
- There are labels accompanied with the reviews. 0 and 1 determining Positive and Negative Sentiment respectively.

- We change a few things to make our data simpler:
- Truncate the Vocabulary to 5 thousands.
- Length of the reviews vary, which we need to make them similar length. Truncate each review to 500 words.
- We need to create a rectangular matrix to pass to our model. We use the Pad Sequences tool from Keras, which allows us to add zeros to short ones, while truncates longer ones to that.
- The shape of our dataset will be:
- 25000 rows
- 500 columns

- We are making a simple a model from data we pre-processed here:
- Figure of the model: (Single Hidden Layer NN)

- Figure of the Conv Net model with 1 Dimension: (Single ConvNet with Max Pooling Layer)

### Pre-Trained Word Embedding:

- How about if we find pre-trained networks, or in our case pre-trained embedding for Text?
- Something similar to the weights accompanied with the VGG models from Imagenet and built on top of that?
- Ans: Yes, we do have such thing. Some examples are:

- Here is a great tutorial to pre-trained word embeddings

- The goal with pre-trained embedding are to collect bodies of text, then capture the structure of this data.
- A step towards un-supervised learning.

- We will use T-sne, which is a dimensionality reduction algorithm to reduce the dimensions and allow us to visualize our words. It takes the 50 dimensions for example into 2, which will be aligned in close / similarity.

- It's very important to mention that most of the time we will be using a pre-trained word embedding for NLP.
- There are tools provided to turn the Glove pre-trained embedding into the similar shape we worked with in the previous steps.

## Recurrent neural network (RNN):

- We need RNNs so we can deal with the following things:
- Variable Length.
- Long-term Dependency.
- Stateful Representations.
- Memory.

- Great Blog Post on RNNs: The Unreasonable Effectiveness of Recurrent Neural Networks

### How To Think Of Neural Networks as Computational Graphs:

- Basic NN with single hidden layer:

- ConvNet:

- Capture state, after merge:

- What if instead of the above merge steps after another, we can do all in 1 as recurring layers.
- This is called RNNs:

- Since, all of the layers for the merge before the RNN shared the same weight matrices, it will be easy to have them layered as in a Recurrent Layer.