Deep Learning Glossary

From Deep Learning Course Wiki
Jump to: navigation, search


Supervised Learning

Training a model when we already know what the output should look like. We use our new model to predict outputs for new data points. There are two types of supervised learning problems: classification and regression. In classification we map inputs into buckets or discrete categories (e.g. Is the sign red, blue, or yellow?). In regression we predict the value of an observation within a continuous range of outputs (What will be the stock price tomorrow?).

Unsupervised Learning

A program, given a dataset, automatically find patterns and relationships in an unlabeled dataset (e.g. clustering emails by topic with no prior knowledge).

Deep Learning

Using neural network architectures with multiple hidden layers of neurons to build predictive models.

Deep learning glossary.png


Reinforcement Learning

Learning the behaviors that maximize a reward through trial and error. Liping Yang provides a great example:

Think e.g. Mario Bros. video game; reinforcement learning would, by trial and error, determine that certain movements and button pushes would advance the player's standing in the game. - Liping Yang


Neural Networks

Neural networks form the foundation of deep learning. I can't describe this better than Liping Yang:

Neural networks are made up of numerous interconnected conceptualized artificial neurons, which pass data between themselves, and which have associated weights which are tuned based upon the network's "experience." Neurons have activation thresholds which, if met by a combination of their associated weights and data passed to them, are fired; combinations of fired neurons result in "learning." - Liping Yang

Neural networks glossary.png



Predict a categorical response (e.g. yes or no?, blue, green or red?)


Predict a continuous response (e.g. price, sales).


Unsupervised grouping of data into baskets.

Clustering glossary.png



A structure that stores a generalized, internal representation of a dataset for description or prediction. When you train a machine learning algorithm on a dataset, the output is a model.


A method, or series of instructions, devised to generate a machine learning model. Examples include linear regression, decision trees, support vector machines, and neural networks.

Model vs Algorithm

  • Model - The representation learned from a dataset and used to make future predictions.
  • Algorithm - The process for learning it.
Model = Algorithm(TrainingData)


A quality describing an observation (e.g. color, size, weight). In Excel we call these column headers.

Two Types of Attributes:

  • Categorical - Categories! Either ordinal (order matters, high-to-low) or nominal (no ordering between values, e.g. color or gender)
  • Continuous - A continuous scale of numbers (e.g. sales predictions, lifespan)


A feature includes an attribute + value (e.g. Color is an attribute. "Color is blue" is a feature). In Excel we call this a "cell."

Feature Vector

A list of features describing an instance or observation. It is another way to describe an observation with multiple attributes. In Excel we call this a row.


A data point, row, or observation containing feature(s).


A data point, row, or instance containing feature(s).


An attribute or several attributes that together describe a property. For example, a geographical dimension might consist of three attributes: country, state, city. A time dimension might include 5 attributes: year, month, day, hour, minute.

Data Cleaning

Improving the quality of the data by modifying its form or content, for example by filling in NULL values or fixing values that are incorrect.


A mapping from unlabeled instances to discrete categories (e.g. Dog or Cat?, Blue, Red, Green?) Some also provide probability estimates or scores.


Accuracy, or Error Rate, describes the percentage of correct predictions made by the model. Accuracy = # correct / total predictions.

Null Accuracy

Accuracy that could be achieved by always predicting the most frequent class ("B Has The Highest Frequency")


In the context of binary classification (Yes/No), specificity measures the model's performance at classifying negative observations (i.e. "No"). In other words, when the correct label is negative, how often is the prediction correct? We could game this metric if we predict everything as negative.

S = \frac{True Negatives}{True Negatives + False Positives}


In the context of binary classification (Yes/No), precision measures the model's performance at classifying positive observations (i.e. "Yes"). In other words, when a positive value is predicted, how often is the prediction correct? We could game this metric by only returning positive for the single observation we are most confident in.

P = \frac{True Positives}{True Positives + False Positives}


Also called sensitivity. In the context of binary classification (Yes/No), recall measures how "sensitive" the classifier is at detecting positive instances. In other words, for all the true observations in our sample, how many did we "catch." We could game this metric by always classifying observations as positive.

R = \frac{True Positives}{True Positives + False Negatives}

Precision vs Recall

Say we are looking at Brain scans and trying to predict whether a person has a tumor (True) or not (False). We feed it into our model and our model starts guessing.

  • Precision is the % of “True” guesses that were actually correct! If we guess 1 image is True out of 100 images and that image is actually True, then our precision is 100%! Our results aren't helpful however because we missed 10 brain tumors! We were super precise when we tried, but we didn’t try hard enough.
  • Recall, or "Sensitivity", provides another lens which with to view how good our model is. Again let’s say there are 100 images, 10 with brain tumors, and we correctly guessed 1 had a brain tumor. Precision is 100%, but recall is 10%. Perfect recall requires that we catch all 10 tumors!

Type 1 Errors

False Positives. E.g. a company optimizes hiring practices to reduce false positives for job offers. In this case, the candidate seemed good and we hired him, but he was actually bad.

Type 2 Errors

False Negatives. E.g. the candidate was GREAT but we passed on him.

Classification Threshold

The lowest probability value at which you're comfortable asserting a positive classification (e.g. If our predicted probability of being diabetic is > 50%, return True, otherwise return False).

Confusion Matrix

Table that describes the performance of a classification model. Source.

Confusion matrix glossary.png

  • True Positives: we correctly predicted that they do have diabetes
  • True Negatives: we correctly predicted that they don't have diabetes
  • False Positives: we incorrectly predicted that they do have diabetes (Type I error)
  • False Negatives: we incorrectly predicted that they don't have diabetes (Type II error)

Linear Regression

When a model's predicted output is continuous and has a constant slope. At its most basic, it takes the form of:

y = ax + b
  • Pros: fast, no tuning required, highly interpretable, well-understood
  • Cons: unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)

Linear regression glossary.png

A more complex linear equation might look like this:

y = B_0 + B_1 x + B_2 z + B_3 j + B_4 k

A linear regression model would try to "learn" the correct values for B_0, B_1, B_2 .. The independent variables x, y, j, k represent the various attributes of each observation in our sample. For sales predictions, these attributes might include: day of the week, employee count, inventory levels, and store location.

y = B_0 + B_1 Day + B_2 Employees + B_3 Inventory + B_4 Location


Evaluating Performance of Linear Regression

  • Mean Absolute Error (MAE) is the mean error among observations
  • Mean Squared Error (MSE) is the mean of the squared errors
  • Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors

Comparison of performance measures:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

Universal Approximation Theorem

Stack of linear functions and non-linear activations can approximate any other function just as close as we want. So we can use it to model anything! Source


Machine learning algorithms like Linear Regression, Random Forests, SVMs, CNNs used to generate models and make predictions.


An algorithm that takes a sample of observations and produces a model that generalizes beyond those observations.


A "bottoms-up" approach to answering questions or solving problems. A logic technique that goes from observations to theory. E.g. We keep observing X, so we infer that Y must be True.


A "top-down" approach to answering questions or solving problems. A logic technique that starts with a theory and tests that theory with observations to derive a conclusion. E.g. We suspect X, but we need to test our hypothesis before coming to any conclusions


Making predictions within the range of your dataset.


Making predictions outside the range of your dataset. E.g. My dog barks, so all dogs must bark. In machine learning we often run into trouble when we extrapolate outside the range of our training data.

Feature Selection


Overfitting occurs when your model learns the training data too well and incorporates details and noise specific to your dataset. You can tell a model is overfitting when it performs great on your training/validation set, but poorly on your test set (or new real-world data).


Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations in your data that would give your model more predictive power. You can tell a model is underfitting when it performs poorly on both training and test sets.

Bias and Variance

Bias and Variance provide a way to measure the predictive power of your "architecture." By architecture I mean everything you do before training your model: algorithm selection, data augmentation, hyperparameter tuning, feature selection, etc. In other words, how successful is your architecture at making predictions when the underlying model is changing due to changes in the training set? If your bias and variance scores are high, you should consider changing your architecture.


After generating multiple predictions for a single observation in a test set, each time training your model with different data..

  • What is the average difference between your predictions and the correct value for that observation?

Low vs High
Low bias could mean that every prediction is correct. It could also mean that half of your predictions are above their actual values and half are below, in equal proportion, resulting in low average difference. High bias (combined with low variance) suggests your model may be underfitting and that you're using the wrong architecture for the job.


After generating multiple predictions for a single observation in a test set, each time training your model with different data..

  • How tightly distributed are your predictions for that observation relative to each other?

Low vs High
Low variance suggests your model is internally consistent, with predictions varying little from each other after every iteration. High variance (combined with low bias) suggests your model may be overfitting and reading too deeply into the noise found in every training set.

Bias-Variance Tradeoff

  • Bias: What is the average difference between your predictions and the correct value for a particular observation?
  • Variance: How tightly packed are your predictions for a particular observation relative to each other?

Bias variance tradeoff.png

Training Set

A set of examples used for learning. It takes the input parameters and iteratively selects how important (weights) each parameter is for predicting the output. After each iteration (adjustment of all weights) it computes its error rate (using a specified formula) and tries to reduce this value.

Validation Set

It is used to avoid "contaminating" your model with information about the test set. It is like a mini-test set that provides feedback to the model during training on how well the current weights generalize beyond the training set. Does not impact or adjust weights directly, but it indirectly impacts weights by providing a stopping point—informing the model that is has started to overfit and it should stop training. Without this, for NNs the model would keep improving until it was perfect! I believe this validation is implemented at the batch-level (number of samples in each weight adjustment before running validation) is processed and the training set updates its weights. If training error decreases but validation error increases, then your model is overfitting and you should stop training. You can also use the validation set to tune hyperparameters. If you tune hyperparameters on the training set while simultaneously altering weights, you won’t know whether the new hyperparameters did anything until you test them… on the test set! But this is bad, because if you start to rely on the test set to tune these hyperparameters you’re introducing a dependency on the test set to train which could mask overfitting when you start processing real world data. Validation is run after every epoch.

Test Set

A set of examples used at the end of training and validation to assess the predictive power of your model (i.e. how generalizable is it to unseen data?)

Model Validation






Learning Rate





Backpropagation is an algorithm that allows us to compute the partial derivatives of the cost function with respect to any weight (w) or bias (b) in the network we are training.[1] This is an essential part of the learning process of the model we build.

[1]neural networks and deeplearning by M.Nelson

Dense Layer

The name Dense Layer is given to regular fully connected neural network layers in several frameworks such as, Keras & Lasagne.

Usually, fully connected neural net layers consists of the following variables:

  • activation
  • weights
  • bias
  • regularizes for the above methods.
  • input and out dimensions to be set, (if neccesary).