Calculus for Deep Learning
WORK IN PROGRESS!
- 1 Derivatives
- 2 Gradients
- 3 Chain Rule
- 4 Resources
- 5 Sources
A derivative can be defined in two ways:
- Instantaneous rate of change (Physics)
- Slope of a line at a specific point (Geometry)
Both represent the same principle, but for our purposes it’s easier to explain using the geometric definition.
In geometry slope represents the steepness of a line. It answers the question: how much does or change given a specific change in ?
Using this definition we can easily calculate the slope between two points:
But what if I asked you, instead of the slope between two points, what is the slope at a single point on the line? In this case there isn’t any obvious “rise over run” to calculate. Derivatives help us answer this question.
A derivative outputs an expression we can use to calculate the “instantaneous rate of change” or slope at a single point on a line. After solving for the derivative you can use it to calculate the slope at every other point on the line.
Consider the graph below, where :.
The slope between (1,4) and (3,12) would be:
But how do we calculate the slope at point (1,4) to reveal the change in slope at that specific point?
- Option 1: Find the two nearest points, calculate their slopes relative to x and take the average.
- Option 2: Calculate the slope from x to another point an infinitesimally small distance away from x
- Option 3: Calculate the derivative
Option 2 and 3 are essentially the same. What happens to the output when we make a very very tiny increase to x? In this way, derivatives estimate the slope between two points that are an infinitesimally small distance away from each other. A very, very, very small distance, but large enough to calculate the slope.
In math language we represent this infinitesimally small increase using a limit. A limit is defined as the output value a function "approaches" as the input value approaches another value. In our case the target value is the specific point at which we want to calculate slope.
Walk Through Example
Calculating the derivative is the same as calculating normal slope, however in this case we calculate the slope between our point and a point infinitesimally close to it. We use the variable to represent this infinitesimally distance.
1. Given the function
2. Increment by a very small value (h = Δx)
3. Apply the slope formula
4. Simplify the equation
5. Set to 0 (the limit as heads toward 0)
So what does this mean? It means for , the slope at any point can be calculated using the function .
Let's write code to calculate the derivative for . We know the derivative should be .
def xSquared(x): return x**2 def getDeriv(func, x): h = 0.0001 return (func(x+h) - func(x)) / h x = 3 derivative = getDeriv(xSquared, x) actual = 2*x derivative, actual = 6.0001, 6
Derivatives in Deep Learning
Machine learning uses derivatives to find optimal solutions to problems. It’s useful in optimization functions like Gradient Descent because it helps us decide whether to increase or decrease our weights in order to maximize or minimize some metrics (e.g. loss)
It also helps us model nonlinear functions as linear functions (tangent lines), which have constant slopes. With a constant slope we can decide whether to move up or down the slope (increase or decrease our weights) to get closer to the target value (class label).
A gradient is a vector that stores the partial derivatives of multivariable functions. It helps us calculate the slope at a specific point on a curve for functions with multiple independent variables. In order to calculate this more complex slope, we need to isolate each variable to determine how it impacts the output on its own. To do this we iterate through each of the variables and calculate the derivative of the function after holding all other variables constant. Each iteration produces a partial derivative which we store in the gradient.
In functions with 2 or more variables, the partial derivative is the derivative of one variable with respect to the others. If we change , but hold all other variables constant, how does change? That’s one partial derivative. The next variable is . If we change but hold constant, how does change? We store partial derivatives in a gradient, which represents the full derivative of the multivariable function.
Here is an example of calculating the gradient for a multivariable function.
1. Given a multivariable function
2. Calculate the derivative with respect to
3. Swap with a constant value
4. Calculate the derivative with constant
As h —> 0
5. Swap back into the equation
6. Arrive at the partial derivative with respect to
7. Repeat steps above to calculate the derivative with respect to
8. Store the partial derivatives in a gradient
How does this gradient relate to the original function? How can it be used?
Another important concept is directional derivatives. When calculating the partial derivatives of multivariable functions we use our old technique of analyzing the impact of infinitesimally small increases to each of our independent variables. By increasing each variable we alter the function output in the direction of the slope.
But what if we want to change directions? For example, imagine we’re traveling north through mountainous terrain on a 3-dimensional plane. The gradient we calculated above tells us we’re traveling north at our current location. But what if we wanted to travel southwest? How can we determine the steepness of the hills in the southwest direction? Directional derivatives help us find the slope if we move in a direction different from the one specified by the gradient.
The directional derivative is computed by taking the dot product of the gradient of and a unit vector of "tiny nudges" representing the direction. The unit vector describes the proportions we want to move in each direction. The output of this calculation is a scalar number representing how much will change if the current input moves with vector .
Kahn Academy explains this great. I'll replay it here:
Let's say you have the function and you want to compute its directional derivative along the following vector:
As described above, we take the dot product of the gradient and the directional vector:
We can rewrite the dot product as:
This should make sense because a tiny nudge along can be broken down into two tiny nudges in the x-direction, three tiny nudges in the y-direction, and a tiny nudge backwards, by −1 in the z-direction.
Properties of Gradients
There are two additional properties of gradients that are especially useful in deep learning. A gradient:
- Always points in the direction of greatest increase of a function (explained here)
- Is zero at a local maximum or local minimum
Why Direction of Steepest Ascent?
Let's start with a simple linear function like . Remember the derivative of a linear function equals the function's slope. In this case the derivative equals 3. Since linear functions only have one independent variable, we only have two levers to manipulate : increase or decrease . If the derivative (slope) of our function is positive it means we should increase to increase the output of our function. Moving in a negative direction would decrease the output. A positive slope requires a positive change to increase .
For negative slopes the logic is reversed. To increase the function output we need to decrease . A negative slope requires a negative change to increase . From this we can see how, for linear functions, the direction of steepest ascent equals the direction of the derivative.
[Graph] and [Graph of Derivative]
The same logic extends to higher-order functions. Let's consider a multivariable function in a 3-dimensional space. The gradient of this function would be:
[INSERT EXPLANATION HERE]
Here is some code for calculating the gradient in python.
Gradients in Deep Learning
Gradients are used in neural networks to adjust weights and optimize cost functions.
The relevant concept in deep learning is called gradient descent. In gradient descent, neural networks use property #1 above to determine how to adjust their weights for each variable (feature) in the model. Rather than moving in the direction of greatest increase, as specified by the gradient, neural networks move in the opposite direction to minimize a loss function, like error percent or Log Loss. After adjusting their weights, neural networks compute the gradient again and move in the direction opposite to the one specified by the gradient.
Neural networks use the concept of directional derivatives to adjust their weights. After computing the gradient of the current weights (features), neural networks identify the direction of greatest decrease to the loss function, and then multiply the current weights by a vector (or matrix) containing the direction and magnitude that minimizes the loss function. The networks can change their weights with varying magnitudes, but the changes must be proportional to maintain the proper direction of greatest decrease.
The chain rule is a formula for calculating the derivatives of composite functions. Composite functions are functions composed of functions inside other function(s).
How It Works
Given a composite function , the derivative of equals the product of the derivative of with respect to and the derivative of with respect to .
For example, given a composite function , where:
The chain rule tells us that the derivative of equals:
Say is composed of two functions and . And that:
The derivative of would equal:
1. Solve for the inner derivative of
2. Solve for the outer derivative of , using a placeholder to represent the inner function
3. Swap out the placeholder variable for the inner function
4. Return the product of the two derivatives
In the above example we assumed a composite function containing a single inner function. But the chain rule can also be applied to higher-order functions like:
The chain rule tells us that the derivative of this function equals:
We can also write this derivative equation notation:
Given the function , lets assume:
The derivatives of these functions would be:
We can calculate the derivative of using the following formula:
We then input the derivatives and simplify the expression:
Chain Rule in Deep Learning
- Wikipedia Derivatives
- Wikipedia Partial Derivatives
- Wikipedia Gradients
- Understanding The Gradient
- Understanding The Gradient
- Khan Academy Partial Derivative And Gradient
- Khan Academy Directional Derivative
- Derivatives Basic Intro
- Definition of Derivative
- Algebra Review
- Intro To Derivatives
- Latex Symbols
- Chain Rule Wikipedia
- Khan Academy Chain Rule
- Chain Rule Lamar.edu