Linear Regression

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Linear Regression A supervised machine learning algorithm where the predicted output is continuous and has a constant slope. Is used to predict values within a continuous range. (e.g. sales, price, height) rather than trying to classify them into categories (e.g. cat, dog, chipmunk). At its most basic, it takes the form of: $y = ax + b$

A more complex linear equation might look like this: $y = B_0 + B_1 x + B_2 z + B_3 j + B_4 k$

A linear regression model would try to "learn" the correct values for $B_0, B_1, B_2 ..$ The independent variables $x, y, j, k$ represent the various attributes of each observation in our sample. For sales predictions, these attributes might include: day of the week, employee count, inventory levels, and store location. $y = B_0 + B_1 Day + B_2 Employees + B_3 Inventory + B_4 Location$

Two types of linear regression: 1. Simple Linear Regression - One independent variable (y = mx + b) 2. Multiple Linear Regression - Multiple independent variables (y = 3a + 4b + .25c + 5)

Use Cases

   Pros: fast runtimes, easy to understand results
Cons: can be less accurate than other models b/c data relationships in the real world are often non-linear


Simple Linear Regression Let’s say we are given a dataset with the following columns (features): how much a company spends on Radio advertising each year and its annual Sales in terms of units sold. We are trying to develop an equation that will let us to predict units sold based on how much a company spends on radio advertising. The rows (observations) represent companies.

0 37.8 22.1 1 39.3 10.4 2 45.9 9.3 3 41.3 18.5 4 10.8 12.9

Prediction Function $Sales = Weight*Radio + Bias$

Weight is the coefficient for the Radio independent variable. In machine learning we call coefficients “weights". Bias is the intercept where our line intercepts the y-axis. In machine learning we can call intercepts “bias”. Bias offsets all predictions that we make. Radio is our independent variable. In machine learning we call these variables “features”.

Our algorithm will try to “learn” the correct values for Weight and Bias.

At the end of our training, our function will approximate the "line of best fit”.

   return weight*radio + bias


Cost Function The predict function is nice, but for our purposes we don't really need it. What we need is a cost function so we can start optimizing our weights. A cost function is a wrapper around our model function that tells us "how good" our model is at making predictions for a given set of parameters. The cost function has its own curve and its own derivatives. The slope of this curve tells us the direction we should update our weights to make the model more accurate!

Let's use Mean Squared Error as our cost function.

Contents

Math

MSE measures the average squared difference between an observation's actual and predicted values. The output is a single number representing the cost, or score, associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our model.

For our simple linear equation: $y = mx + b$

MSE can be calculated with the formula: $MSE = \frac{1}{N} \sum_{i=1}^{n} (y_i - (mx_i + b))^2$
• $N$ is the total number of observations (data points)
• $\frac{1}{N} \sum_{i=1}^{n}$ is the mean
• $y_i$ is the actual value of an observation and $(mx_i + b)$ is our prediction

Code

Calculating the mean squared error in python.

# MSE for sales = weight*radio + bias
total_error = 0.0
for i in range(companies):
total_error += (sales[i] - (weight*radio[i] + bias))**2


Gradient Descent To minimize MSE we need to calculate the gradient of our cost function. A good introduction to gradient descent can be found on our wiki.

Math

There are two "parameters" (i.e. coefficients) in our cost function we can control: $weight$ and $bias$. Since we need to consider the impact each one has on the final prediction, we use partial derivatives. To find the partial derivatives, we use the chain rule. We need the chain rule because $(y - (mx + b))^2$ is really 2 nested functions, inner $y-mx+b$ and outer $x^2$. An explanation of the math behind this can be found here. An overview of the chain rule more generally can be found on our wiki.

Given the cost function: $f(m,b) = \frac{1}{N} \sum_{i=1}^{n} (y_i - (mx_i + b))^2$

The gradient of this cost function would be: $f'(m,b) = \begin{bmatrix} \frac{df}{dm}\\ \frac{df}{db}\\ \end{bmatrix} = \begin{bmatrix} \frac{1}{N} \sum -2x_i(y_i - (mx_i + b)) \\ \frac{1}{N} \sum -2(y_i - (mx_i + b)) \\ \end{bmatrix}$

Code

To solve for the gradient, we iterate through our data points using our new $weight$ and $bias$ values and take the average of the partial derivatives. The resulting gradient tells us the slope of our cost function at our current position (i.e. weight and bias) and the direction we should update to reduce our cost function (we move in the direction opposite the gradient). The size of our update is controlled by the learning rate.

def update_weights(radio, sales, weight, bias, learning_rate):
weight_deriv = 0
bias_deriv = 0
for i in range(companies):
# Calculate partial derivatives
# -2x(y - (mx + b))

# -2(y - (mx + b))
bias_deriv += -2*(sales[i] - (weight*radio[i] + bias))

# We subtract because the derivatives point in direction of steepest ascent
weight -= (weight_deriv / companies) * learning_rate
bias -= (bias_deriv / companies) * learning_rate

return weight, bias


Train Training a model is the process of iteratively improving your prediction equation by looping through the dataset multiple times, each time updating the weight and bias values in the direction indicated by the slope of the cost function (gradient). Training is complete when we reach some predetermined “acceptable error” threshold, or when subsequent training iterations fail to reduce our cost.

Before training we need to initializing our weights (set default values), set our hyper-parameters (learning rate and number of iterations), and prepare to log our progress over each iteration.

Code

def train(radio, sales, weight, bias, learning_rate, iters):
cost_history = []

for i in range(iters):
weight,bias = update_weights(radio, sales, weight, bias, learning_rate)

#Calculate cost for auditing purposes
cost = cost_function(features, targets, weights)
cost_history.append(cost)

# Log Progress
if i % 10 == 0:
print "iter: "+str(i) + " cost: "+str(cost)

return weight,bias,cost_history


Results If our model is working, we should see our cost decrease after every iteration.

iter=1 weight=.03 bias=.0014 cost=197.25 iter=10 weight=.28 bias=.0116 cost=74.65 iter=20 weight=.39 bias=.0177 cost=49.48 iter=30 weight=.44 bias=.0219 cost=44.31 iter=30 weight=.46 bias=.0249 cost=43.28

Cost Function Over Time

Conclusion Now we have discovered our weight (.46) and bias (.25), our model has created an equation we can use to predict sales in the future based on radio spend: $Sales = .46*Radio + .025$


How would our model perform in the real world? I’ll let you think about it :)