# Linear Regression

WORK IN PROGRESS!

# Overview

Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. Is used to predict values within a continuous range. (e.g. sales, price, height) rather than trying to classify them into categories (e.g. cat, dog, chipmunk).

Simple Linear Regression
Simple linear regression uses traditional slope-intercept form, where ${\displaystyle m}$ and ${\displaystyle b}$ are the variables our algorithm will try to "learn" to produce the most accurate predictions. ${\displaystyle x}$ represents our input data and ${\displaystyle y}$ represents our prediction.

${\displaystyle y=mx+b}$

Multiple Linear Regression
A more complex, multi-variable linear equation might look like this, where ${\displaystyle W}$'s represent the coefficients, or weights, our model will try to learn.

${\displaystyle f(x,y,z)=W_{1}x+W_{2}y+W_{3}z}$

The variables ${\displaystyle x,y,z}$ represent the attributes, or distinct pieces of information, we have about each observation. For sales predictions, these attributes might include a company's advertising spend on radio, TV, and newspapers.

${\displaystyle Sales=W_{1}Radio+W_{2}TV+W_{3}News}$

• Pros: fast runtimes, easy to understand results
• Cons: can be less accurate than other models b/c data relationships in the real world are often non-linear

# Simple Linear Regression

Let’s say we are given a dataset with the following columns (features): how much a company spends on Radio advertising each year and its annual Sales in terms of units sold. We are trying to develop an equation that will let us to predict units sold based on how much a company spends on radio advertising. The rows (observations) represent companies.

Amazon 37.8 22.1
Apple 41.3 18.5

## Predict Function

Our prediction function outputs an estimate of sales given a company's radio advertising spend and our current values for Weight and Bias.

Math

${\displaystyle Sales=Weight*Radio+Bias}$
• Weight is the coefficient for the Radio independent variable. In machine learning we call coefficients weights.
• Bias is the intercept where our line intercepts the y-axis. In machine learning we can call intercepts bias. Bias offsets all predictions that we make.
• Radio is our independent variable. In machine learning we call these variables features.

Our algorithm will try to "learn" the correct values for Weight and Bias. By the end of our training, our equation will approximate the line of best fit.

Code
Here is our equation rewritten in python:

def predict_sales(radio, weight, bias):


## Cost Function

The predict function is nice, but for our purposes we don't really need it. What we need is a cost function so we can start optimizing our weights. A cost function is a wrapper around our model function that tells us "how good" our model is at making predictions for a given set of parameters. The cost function has its own curve and its own derivatives. The slope of this curve tells us the direction we should update our weights to make the model more accurate!

Math
Let's use Mean Squared Error as our cost function. MSE measures the average squared difference between an observation's actual and predicted values. The output is a single number representing the cost, or score, associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our model.

For our simple linear equation:

${\displaystyle y=mx+b}$

MSE can be calculated with the formula:

${\displaystyle MSE={\frac {1}{N}}\sum _{i=1}^{n}(y_{i}-(mx_{i}+b))^{2}}$
• ${\displaystyle N}$ is the total number of observations (data points)
• ${\displaystyle {\frac {1}{N}}\sum _{i=1}^{n}}$ is the mean
• ${\displaystyle y_{i}}$ is the actual value of an observation and ${\displaystyle (mx_{i}+b)}$ is our prediction

Code
Calculating the mean squared error in python.

# MSE for sales = weight*radio + bias
total_error = 0.0
for i in range(companies):
total_error += (sales[i] - (weight*radio[i] + bias))**2


To minimize MSE we need to calculate the gradient of our cost function. A good introduction to gradient descent can be found on our wiki.

Math
There are two "parameters" (i.e. coefficients) in our cost function we can control: ${\displaystyle weight}$ and ${\displaystyle bias}$. Since we need to consider the impact each one has on the final prediction, we use partial derivatives. To find the partial derivatives, we use the chain rule. We need the chain rule because ${\displaystyle (y-(mx+b))^{2}}$ is really 2 nested functions, inner ${\displaystyle y-mx+b}$ and outer ${\displaystyle x^{2}}$. An explanation of the math behind this can be found here. An overview of the chain rule more generally can be found on our wiki.

Given the cost function:

${\displaystyle f(m,b)={\frac {1}{N}}\sum _{i=1}^{n}(y_{i}-(mx_{i}+b))^{2}}$

The gradient of this cost function would be:

${\displaystyle f'(m,b)={\begin{bmatrix}{\frac {df}{dm}}\\{\frac {df}{db}}\\\end{bmatrix}}={\begin{bmatrix}{\frac {1}{N}}\sum -2x_{i}(y_{i}-(mx_{i}+b))\\{\frac {1}{N}}\sum -2(y_{i}-(mx_{i}+b))\\\end{bmatrix}}}$

Code
To solve for the gradient, we iterate through our data points using our new ${\displaystyle weight}$ and ${\displaystyle bias}$ values and take the average of the partial derivatives. The resulting gradient tells us the slope of our cost function at our current position (i.e. weight and bias) and the direction we should update to reduce our cost function (we move in the direction opposite the gradient). The size of our update is controlled by the learning rate.

def update_weights(radio, sales, weight, bias, learning_rate):
weight_deriv = 0
bias_deriv = 0
for i in range(companies):
# Calculate partial derivatives
# -2x(y - (mx + b))

# -2(y - (mx + b))
bias_deriv += -2*(sales[i] - (weight*radio[i] + bias))

# We subtract because the derivatives point in direction of steepest ascent
weight -= (weight_deriv / companies) * learning_rate
bias -= (bias_deriv / companies) * learning_rate

return weight, bias


## Training

Training a model is the process of iteratively improving your prediction equation by looping through the dataset multiple times, each time updating the weight and bias values in the direction indicated by the slope of the cost function (gradient). Training is complete when we reach some predetermined “acceptable error” threshold, or when subsequent training iterations fail to reduce our cost.

Before training we need to initializing our weights (set default values), set our hyper-parameters (learning rate and number of iterations), and prepare to log our progress over each iteration.

Code

def train(radio, sales, weight, bias, learning_rate, iters):
cost_history = []

for i in range(iters):
weight,bias = update_weights(radio, sales, weight, bias, learning_rate)

#Calculate cost for auditing purposes
cost = cost_function(features, targets, weights)
cost_history.append(cost)

# Log Progress
if i % 10 == 0:
print "iter: "+str(i) + " cost: "+str(cost)

return weight,bias,cost_history


## Results

If our model is working, we should see our cost decrease after every iteration.

Logging

 iter=1     weight=.03    bias=.0014    cost=197.25
iter=10    weight=.28    bias=.0116    cost=74.65
iter=20    weight=.39    bias=.0177    cost=49.48
iter=30    weight=.44    bias=.0219    cost=44.31
iter=30    weight=.46    bias=.0249    cost=43.28


Visualization

Cost Function

## Conclusion

By learning the best values for weight (.46) and bias (.25), we now have an equation that predicts future sales based on radio advertising investment.

${\displaystyle Sales=.46Radio+.025}$

How would our model perform in the real world? I’ll let you think about it :)

# Multiple Linear Regression

Let’s say we are given data on TV, radio, and newspaper advertising spend for a list of companies, and our goal is to predict sales (units sold).

Amazon 230.1 37.8 69.2 22.1
Apple 151.5 18.5 13.2 28.2

Growing Complexity
As the number of features grows, the complexity of our model increases and it becomes increasingly difficult to visualize, or even comprehend, our data.

from mpl_toolkits.mplot3d import Axes3D

def plot_3d_plane(x,y,z,xlabel,ylabel,zlabel):
fig = plt.figure()
n = 100
for c, m, zl, zh in [('g', 'o', -50, -25)]:
ax.plot_trisurf(x, y, z, cmap=plt.cm.cool)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.set_zlabel(zlabel)
plt.show()


One solution is to break the data apart and compare 1-2 features at a time. In this example we explore how Radio and TV investment impacts Sales.

## Normalization

As the number of features grows, calculating gradient takes longer to compute. We can speed this up by "normalizing" our input data to ensure all values are within the same range. This is especially important for datasets with high standard deviations or differences in the ranges of the attributes. Our goal now will be to normalize our features so they are all in the range -1 to 1.

Pseudocode

 For each feature column {
#1 Subtract the mean of the column (mean normalization)
#2 Divide by the range of the column (feature scaling)
}


Code
Our input is a 200 x 3 matrix containing TV, Radio, and Newspaper data. Our output is a normalized matrix of the same shape with all values between -1 and 1.

def normalize(features):
'''
features     -   (200, 3)
features.T   -   (3, 200)
'''
# We transpose our matrix, swapping columns
# and rows to make vector math easier
for feature in features.T:
fmean = np.mean(feature)
frange = np.amax(feature) - np.amin(feature)

#Vector Subtraction
feature -= fmean

#Vector Division
feature /= frange

return features


## Matrix Math

Before we continue, it's helpful to understand how basic matrix operations work in Numpy. For example, Numpy dot product numpy.dot() can be used on both 1D and 2D arrays.

## Predict Function

Our predict function outputs an estimate of sales given our current weights (coefficients) and a company's TV, radio, and newspaper spend. Our model will try to identify weight values that most reduce our cost function.

${\displaystyle Sales=W_{1}TV+W_{2}Radio+W_{3}Newspaper}$

Initialize Weights

W1 = 0.0
W2 = 0.0
W3 = 0.0
weights = np.array([
[W1],
[W2],
[W3]
])


Predict Function

def predict(features, weights):
'''
features - (200, 3)
weights - (3, 1)
predictions - (200,1)
'''
return np.dot(features,weights)


## Cost Function

As we discussed above, we need a cost function to audit how our model is performing.

Math
The math is the same, except we swap our ${\displaystyle mx+b}$ expression for ${\displaystyle W_{1}x_{1}+W_{2}x_{2}+W_{3}x_{3}}$. We also divide our expression by 2 to make our derivative calculation simpler (don't ask questions).

${\displaystyle MSE={\frac {1}{2N}}\sum _{i=1}^{n}(y_{i}-(W_{1}x_{1}+W_{2}x_{2}+W_{3}x_{3}))^{2}}$

Code
Calculating the mean squared error in python.

def cost_function(features, targets, weights):
'''
Features:(200,3)
Targets: (200,1)
Weights:(3,1)
Returns 1D matrix of predictions
'''
N = len(targets)

predictions = predict(features, weights)

# Matrix math lets use do this without looping
sq_error = (predictions - targets)**2

# Return average squared error among predictions
return 1.0/(2*N) * sq_error.sum()


Again using the chain rule we can compute the gradient--a vector of partial derivatives describing the slope of the cost function for each weight.

${\displaystyle {\frac {d}{dW_{1}}}=-x_{1}(y-(W_{1}x_{1}+W_{2}x_{2}+W_{3}x_{3}))}$
${\displaystyle {\frac {d}{dW_{2}}}=-x_{2}(y-(W_{1}x_{1}+W_{2}x_{2}+W_{3}x_{3}))}$
${\displaystyle {\frac {d}{dW_{3}}}=-x_{3}(y-(W_{1}x_{1}+W_{2}x_{2}+W_{3}x_{3}))}$

Code

def update_weights(features, targets, weights, lr):
'''
Features:(200, 3)
Targets: (200, 1)
Weights:(3, 1)
'''
predictions = predict(features, weights)

#Extract our features
x1 = features[:,0]
x2 = features[:,1]
x3 = features[:,2]

# Use matrix cross product (*) to simultaneously
# calculate the derivative for each weight
d_w1 = -x1*(targets - predictions)
d_w2 = -x2*(targets - predictions)
d_w3 = -x3*(targets - predictions)

# Multiply the mean derivative by the learning rate
# and subtract from our weights (remember gradient points in direction of steepest ASCENT)
weights[0][0] -= (lr * np.mean(d_w1))
weights[1][0] -= (lr * np.mean(d_w2))
weights[2][0] -= (lr * np.mean(d_w3))

return weights


The above gradient descent code has a lot of duplication. Can we improve it somehow? One way to refactor would be to loop through our features and weights--allowing our function handle any number of features. However there is another even better technique: vectorized gradient descent.

In VGD we use the same formula as above, but instead of operating on a single feature at a time, we use matrix multiplication to operative on all features and weights simultaneously! Math explained here.

Old way

${\displaystyle d/dW_{1}=-x_{1}(targets-predictions)}$

New way
Replace ${\displaystyle x_{1},x_{2},x_{3}}$ with our entire features matrix.

${\displaystyle gradient=-features(targets-predictions)}$

Inputs

features = [
[x1, x2, x3]
[x1, x2, x3]
[x1, x2, x3]
]
targets-predictions = [
[1],
[2],
[3]
]


def update_weights_vectorized(features, targets, weights, lr):
'''
gradient = features.T * (predictions - targets) / N
Features: (200, 3)
Targets: (200, 1)
Weights: (3, 1)
'''
companies = len(features)

#1 - Get Predictions
predictions = predict(features, weights)

#2 - Calculate error/loss
error = targets - predictions

#3 Transpose features from (200, 3) to (3, 200)
# So we can multiply w the (200,1)  error matrix.
# Returns a (3,1) matrix holding 3 partial derivatives --
# one for each feature -- representing the aggregate
# slope of the cost function across all observations

#4 Take the average error derivative for each feature

#5 - Multiply the gradient by our learning rate

#6 - Subtract from our weights to minimize cost

return weights


Our train function is the same as for simple linear regression, however we're going to make one final tweak before running: add a bias term to our feature matrix. There is a good explanation for why on Quora:

In linear regression, without the bias term your solution has to go through the origin. That is, when all of your features are zero, your predicted value would also have to be zero. However, that may not be the answer the training data suggests. Adding a bias weight that does not depend on any of the features allows the hyperplane described by your learned weights to more easily fit data that doesn't pass through the origin.

Essentially, when you're deciding whether to add a bias term you have to ask yourself: if all my feature values were 0, would my output also be zero? Is it possible there is some base value upon which my features values have an effect? In our example, it's very unlikely that sales would be zero if companies stopped advertising. Possible reasons for this might include past advertising, existing customer relationships, retail locations, and salespeople.

# Here we add a constant bias term of 1
# By setting this value to 1, it turns our bias
# weight into a constant.
bias = np.ones(shape=(len(features),1))
features = np.append(bias, features, axis=1)


## Results

After training our model through 1000 iterations with a learning rate of .0005, we finally arrive at a set of weights we can use to make predictions:

${\displaystyle Sales=4.7TV+3.5Radio+.81Newspaper+13.9}$

Our MSE cost dropped from 110.86 to 6.25.