Calculus for Deep Learning

From Deep Learning Course Wiki
Revision as of 17:37, 28 January 2017 by Bfortuner (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



A derivative can be defined in two ways:

  1. Instantaneous rate of change (Physics)
  2. Slope of a line at a specific point (Geometry)

Both represent the same principle, but for our purposes it’s easier to explain using the geometric definition.


In geometry slope represents the steepness of a line. It answers the question: how much does y or f(x) change given a specific change in x?

Calculus Slope Basic.png

Using this definition we can easily calculate the slope between two points:

Calculus Slope Example.png

But what if I asked you, instead of the slope between two points, what is the slope at a single point on the line? In this case there isn’t any obvious “rise over run” to calculate. Derivatives help us answer this question.

A derivative outputs an expression we can use to calculate the “instantaneous rate of change” or slope at a single point on a line. After solving for the derivative you can use it to calculate the slope at every other point on the line.


Consider the graph below, where :f(x) = x^2+3..

Calculus Slope Calculation Example Secant.png

The slope between (1,4) and (3,12) would be:

\frac{y2-y1}{x2-x1} = \frac{12-4}{3-1} = 4

But how do we calculate the slope at point (1,4) to reveal the change in slope at that specific point?

  • Option 1: Find the two nearest points, calculate their slopes relative to x and take the average.
  • Option 2: Calculate the slope from x to another point an infinitesimally small distance away from x
  • Option 3: Calculate the derivative

Option 2 and 3 are essentially the same. What happens to the output f(x) when we make a very very tiny increase to x? In this way, derivatives estimate the slope between two points that are an infinitesimally small distance away from each other. A very, very, very small distance, but large enough to calculate the slope.

In math language we represent this infinitesimally small increase using a limit. A limit is defined as the output value a function "approaches" as the input value approaches another value. In our case the target value is the specific point at which we want to calculate slope.

Walk Through Example

Calculating the derivative is the same as calculating normal slope, however in this case we calculate the slope between our point and a point infinitesimally close to it. We use the variable h to represent this infinitesimally distance.


1. Given the function

f(x) = x^2

2. Increment x by a very small value h (h = Δx)

f(x + h) = (x + h)^2

3. Apply the slope formula

\frac{f(x + h) - f(x)}{h}

4. Simplify the equation

\frac{x^2+2xh+h^2-x^2}{h} = \frac{2xh+h^2}{h} = 2x+h

5. Set h to 0 (the limit as h heads toward 0)

{2x + 0} = {2x}


So what does this mean? It means for f(x) = x^2, the slope at any point can be calculated using the function 2x.


\frac{df}{dx} = \lim_{h\to0}\frac{f(x+h) - f(x)}{h}


Let's write code to calculate the derivative for f(x) = x^2. We know the derivative should be 2x.

def xSquared(x):
    return x**2

def getDeriv(func, x):
  h = 0.0001
  return (func(x+h) - func(x)) / h

x = 3
derivative = getDeriv(xSquared, x)
actual = 2*x

derivative, actual = 6.0001, 6

Derivatives in Deep Learning

Machine learning uses derivatives to find optimal solutions to problems. It’s useful in optimization functions like Gradient Descent because it helps us decide whether to increase or decrease our weights in order to maximize or minimize some metrics (e.g. loss)

It also helps us model nonlinear functions as linear functions (tangent lines), which have constant slopes. With a constant slope we can decide whether to move up or down the slope (increase or decrease our weights) to get closer to the target value (class label).



A gradient is a vector that stores the partial derivatives of multivariable functions. It helps us calculate the slope at a specific point on a curve for functions with multiple independent variables. In order to calculate this more complex slope, we need to isolate each variable to determine how it impacts the output on its own. To do this we iterate through each of the variables and calculate the derivative of the function after holding all other variables constant. Each iteration produces a partial derivative which we store in the gradient.

Partial Derivatives

In functions with 2 or more variables, the partial derivative is the derivative of one variable with respect to the others. If we change x, but hold all other variables constant, how does f(x,z) change? That’s one partial derivative. The next variable is z. If we change z but hold x constant, how does f(x,z) change? We store partial derivatives in a gradient, which represents the full derivative of the multivariable function.

Walk-Through Example

Here is an example of calculating the gradient for a multivariable function.


1. Given a multivariable function

f(x,z) = 2z^3x^2

2. Calculate the derivative with respect to x

\frac{df}{dx}(x,z) =  ??

3. Swap 2z^3 with a constant value b

f(x,z) = bx^2

4. Calculate the derivative with b constant

\lim_{h\to0}\frac{f(x+h) - f(x)}{h}
\lim_{h\to0}\frac{b(x+h)^2 - b(x^2)}{h}
\lim_{h\to0}\frac{b((x+h)(x+h)) - bx^2}{h}
\lim_{h\to0}\frac{b((x^2 + xh + hx + h^2)) - bx^2}{h}
\lim_{h\to0}\frac{bx^2 + 2bxh + bh^2 - bx^2}{h}
\lim_{h\to0}\frac{2bxh + bh^2}{h}
\lim_{h\to0}\frac{2bxh + bh^2}{h}
\lim_{h\to0} 2bx + bh

As h —> 0

2bx + 0

5. Swap 2z^3 back into the equation


6. Arrive at the partial derivative with respect to x

\frac{df}{dx}(x,z) = 4z^3x

7. Repeat steps above to calculate the derivative with respect to z

\frac{df}{dz}(x,z) = 6x^2z^2

8. Store the partial derivatives in a gradient

     \nabla f(x,z)=\begin{bmatrix}
         \frac{df}{dx} \\
         \frac{df}{dz} \\
         4z^3x \\
         6x^2z^2 \\


How does this gradient relate to the original function? How can it be used?

Directional Derivatives

Another important concept is directional derivatives. When calculating the partial derivatives of multivariable functions we use our old technique of analyzing the impact of infinitesimally small increases to each of our independent variables. By increasing each variable we alter the function output in the direction of the slope.

But what if we want to change directions? For example, imagine we’re traveling north through mountainous terrain on a 3-dimensional plane. The gradient we calculated above tells us we’re traveling north at our current location. But what if we wanted to travel southwest? How can we determine the steepness of the hills in the southwest direction? Directional derivatives help us find the slope if we move in a direction different from the one specified by the gradient.


The directional derivative is computed by taking the dot product of the gradient of f and a unit vector \vec{v} of "tiny nudges" representing the direction. The unit vector describes the proportions we want to move in each direction. The output of this calculation is a scalar number representing how much f will change if the current input moves with vector \vec{v}.

Kahn Academy explains this great. I'll replay it here:

Let's say you have the function f(x,y,z) and you want to compute its directional derivative along the following vector:

         2 \\
         3 \\
         -1  \\

As described above, we take the dot product of the gradient and the directional vector:

       \frac{df}{dx} \\
       \frac{df}{dx} \\
       \frac{df}{dx}  \\
         2 \\
         3 \\
         -1  \\

We can rewrite the dot product as:

\nabla_\vec{v} f = 2 \frac{df}{dx} + 3 \frac{df}{dy} - 1 \frac{df}{dz}

This should make sense because a tiny nudge along \vec{v} can be broken down into two tiny nudges in the x-direction, three tiny nudges in the y-direction, and a tiny nudge backwards, by −1 in the z-direction.

Properties of Gradients

There are two additional properties of gradients that are especially useful in deep learning. A gradient:

  1. Always points in the direction of greatest increase of a function (explained here)
  2. Is zero at a local maximum or local minimum

Why Direction of Steepest Ascent?

Let's start with a simple linear function like f(x) = 3x. Remember the derivative of a linear function equals the function's slope. In this case the derivative equals 3. Since linear functions only have one independent variable, we only have two levers to manipulate f(x): increase or decrease x. If the derivative (slope) of our function is positive it means we should increase x to increase the output of our function. Moving x in a negative direction would decrease the output. A positive slope requires a positive change to increase f(x).

For negative slopes the logic is reversed. To increase the function output we need to decrease x. A negative slope requires a negative change to increase f(x). From this we can see how, for linear functions, the direction of steepest ascent equals the direction of the derivative.

[Graph] and [Graph of Derivative]

The same logic extends to higher-order functions. Let's consider a multivariable function f(x,y) = 2x^2 - y^2 in a 3-dimensional space. The gradient of this function would be:

         \frac{df}{dx} \\
         \frac{df}{dy} \\
         4x \\
         -2y \\



Here is some code for calculating the gradient in python.

Gradients in Deep Learning

Gradients are used in neural networks to adjust weights and optimize cost functions.

Gradient Descent

The relevant concept in deep learning is called gradient descent. In gradient descent, neural networks use property #1 above to determine how to adjust their weights for each variable (feature) in the model. Rather than moving in the direction of greatest increase, as specified by the gradient, neural networks move in the opposite direction to minimize a loss function, like error percent or Log Loss. After adjusting their weights, neural networks compute the gradient again and move in the direction opposite to the one specified by the gradient.

Adjusting Weights

Neural networks use the concept of directional derivatives to adjust their weights. After computing the gradient of the current weights (features), neural networks identify the direction of greatest decrease to the loss function, and then multiply the current weights by a vector (or matrix) containing the direction and magnitude that minimizes the loss function. The networks can change their weights with varying magnitudes, but the changes must be proportional to maintain the proper direction of greatest decrease.

Chain Rule

The chain rule is a formula for calculating the derivatives of composite functions. Composite functions are functions composed of functions inside other function(s).

How It Works

Given a composite function f(x) = A(B(x)), the derivative of f(x) equals the product of the derivative of A with respect to B(x) and the derivative of B with respect to x.

\mbox{composite function derivative} = \mbox{outer function derivative} * \mbox{inner function derivative}

For example, given a composite function f(x), where:

f(x) = h(g(x))

The chain rule tells us that the derivative of f(x) equals:

\frac{df}{dx} = \frac{dh}{dg} \cdot \frac{dg}{dx}

Walk-Through Example

Say f(x) is composed of two functions h(x) = x^3 and g(x) = x^2. And that:

f(x) = h(g(x))
f(x) = (x^2)^3

The derivative of f(x) would equal:

\frac{df}{dx}  =  \frac{dh}{dg} \frac{dg}{dx}  =  \frac{dh}{d(x^2)} \frac{dg}{dx}


1. Solve for the inner derivative of g(x) = x^2

\frac{dg}{dx} = 2x

2. Solve for the outer derivative of h(x) = x^3, using a placeholder b to represent the inner function x^2

\frac{dh}{db} = 3b^2

3. Swap out the placeholder variable for the inner function


4. Return the product of the two derivatives

3x^4 \cdot 2x = 6x^5

Higher Dimensions

In the above example we assumed a composite function containing a single inner function. But the chain rule can also be applied to higher-order functions like:

f(x) = A(B(C(x)))

The chain rule tells us that the derivative of this function equals:

\frac{df}{dx} = \frac{dA}{dB} \frac{dB}{dC} \frac{dC}{dx}

We can also write this derivative equation f' notation:

f' = A'(B(C(x)) \cdot B'(C(x)) \cdot C'(x)


Given the function f(x) = A(B(C(x))), lets assume:

A(x) = sin(x)
B(x) = x^2
C(x) = 4x

The derivatives of these functions would be:

A'(x) = cos(x)
B'(x) = 2x
C'(x) = 4

We can calculate the derivative of f(x) using the following formula:

f'(x) = A'( (4x)^2) \cdot B'(4x) \cdot C'(x)

We then input the derivatives and simplify the expression:

f'(x) = cos((4x)^2) \cdot 2(4x) \cdot 4
f'(x) = cos(16x^2) \cdot 8x \cdot 4
f'(x) = cos(16x^2)32x


Chain Rule in Deep Learning