# Logistic Regression

INCOMPLETE. WORK IN PROGRESS!!!

## Overview

Logistic regression is a classification algorithm, used to estimate probabilities (Binary values like 0/1, yes/no, true/false) based on given set of independent variable(s). Its output values lies between 0 and 1. Prior to building a model, the features values are transformed using the logistic function (Sigmoid) to produce probability values that can be mapped to two or more classes.

### Linear vs Logistic

Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different things:

• Linear Regression could help us predict the student's test score on a scale of 0 - 100. Linear regression predictions are continuous (numbers in a range).
• Logistic Regression could help use predict whether the student passed or failed. Logistic regression predictions are discrete (only specific values or categories are allowed). We can also view probability scores underlying the model's classifications.

### Variations

• Binary (Pass/Fail)
• Multi (Cats, Dogs, Sheep)
• Ordinal (Low, Medium, High)

• Pros: Easy to implement, fast to train, returns probability scores
• Cons: Bad when too many features or too many classifications

## Binary Classification

Say we're given a dataset about student exam results and our goal is to predict whether a student will pass or fail based on number of hours slept and hours spent studying. We have two features (hours slept, hours studied) and two classes: passed (1) and failed (0).

Studied Slept Passed
4.85 9.63 1
8.62 3.23 0
5.43 8.23 1
9.21 6.34 0

Graphically we could represent our data with a scatter plot.

## Sigmoid Function

In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use Sigmoid to map predictions to probabilities.

Math

${\displaystyle z=w_{0}+w_{1}x_{1}+w_{2}x_{2}}$
${\displaystyle s(z)={\frac {1}{1+e^{-z}}}}$
• ${\displaystyle s(z)}$ = output between 0 and 1 (probability estimate)
• ${\displaystyle z}$ = input to the function (your algorithm's prediction e.g. mx + b)
• ${\displaystyle e}$ = base of natural log (wikipedia)

Visualize
Here is a graphical representation of the sigmoid function:

Code
Transforming a real number to a value between 0 and 1.

def sigmoid(z):
return  1 / (1 + np.exp(-z))


## Decision Boundary

A decision boundary is a pretty simple concept. Logistic regression is a classification algorithm, the output should be a category: Yes/No, True/False, Red/Yellow/Orange. Our prediction function however returns a probability score between 0 and 1. A decision boundary is a threshold or tipping point that helps us decide which category to choose based on probability.

${\displaystyle p\geq 0.5,class=1}$
${\displaystyle p<0.5,class=0}$

For example, if our threshold was .5 and our prediction function returned .7, we would classify this observation as positive. If our prediction was .2 we would classify the observation as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.

## Prediction Function

Using our knowledge of sigmoid functions and decision boundaries, we can now write a prediction function. A prediction function in logistic regression returns the probability of our observation being positive, True, or "Yes". We call this class 1 and its notation is P(class=1). As the probability gets closer to 1, our model is more confident that the observation is in class 1.

Math
Let's use the same multiple linear equation from our Linear Regression wiki.

${\displaystyle z=W_{0}+W_{1}Studied+W_{2}Slept}$

This time however we will transform the output using the sigmoid function to return a probability value between 0 and 1.

${\displaystyle P(class=1)={\frac {1}{1+e^{-z}}}}$

If the model returns .4 it believes there is only a 40% chance of passing. If our decision boundary was .5, we would categorize this observation as "Fail."

Code
We wrap the sigmoid function over our old linear regression prediction function.

def predict(features, weights):
'''
Returns 1D array of probabilities
that the class label == 1
'''
return 1 / (1 + np.exp(-np.dot(features,weights)))


## Cost Function

Unfortunately we can't (or at least shouldn't) use the same cost function (MSE) as we did for linear regression. Why? There is a great math explanation here and here, but for now I'll simply say it's because our prediction function is non-linear (due to sigmoid transform). Squaring this prediction as we do in MSE results in a non-convex function with many local minimums. If our cost function has many local minimums, gradient descent may not find the optimal global minimum.

Math
Instead of Mean Squared Error, we use a cost function called Log Loss, also known as Negative Log-Likelihood or Cross-Entropy Loss. Log Loss can be divided into two separate cost functions, one for y=1 and one for y=0.

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Source.

The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.

The above functions compressed into one

Multiplying by y and (1-y) in the above equation is a sneaky trick that let's us use the same equation to solve for both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform.

Vectorized cost function

Code

# Using Mean Absolute Error
def cost_function(features, labels, weights):
'''
Features:(100,3)
Labels: (100,1)
Weights:(3,1)
Returns 1D matrix of predictions
Cost = ( log(predictions) + (1-labels)*log(1-predictions) ) / len(labels)
'''
observations = len(labels)

predictions = predict(features, weights)

#Take the error when label=1
class1_cost = -labels*np.log(predictions)

#Take the error when label=0
class2_cost = (1-labels)*np.log(1-predictions)

#Take the sum of both costs
cost = class1_cost - class2_cost

#Take the average cost
cost = cost.sum()/observations

return cost


To minimize our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisticated optimization algorithms out there such as conjugate gradient, BFGS, and L-BFGS, but you don't have to worry about these. Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things!

Math
One of the neat properties of the sigmoid function is its derivative is easy to calculate. You can find a walk-through of the derivation here and a more detailed overview here

${\displaystyle z=w_{0}+w_{1}x_{1}+w_{2}x_{2}}$
${\displaystyle s(z)={\frac {1}{1+e^{-z}}}}$
${\displaystyle s'(z)=s(z)(1-s(z))}$

This leads to an equally beautiful and convenient log loss derivative:

${\displaystyle C'=x(s(z)-{\hat {y}})}$
• ${\displaystyle C'}$ is the derivative of cost with respect to weights
• ${\displaystyle {\hat {y}}}$ is the actual class label (y=0 or y=1)
• ${\displaystyle z}$ is your model's prediction prior to applying sigmoid (${\displaystyle w_{0}+w_{1}x_{1}+w_{2}x_{2}}$)
• ${\displaystyle x}$ is your feature or feature vector.

Notice how this gradient is the same as the Mean Squared Error gradient in Linear Regression! The only difference is the hypothesis function.

Pseudocode

2. Multiply by learning rate
3. Subtract from weights

Code

# Vectorized Gradient Descent
# gradient = X.T * (X*W - y) / N
# gradient = features.T * (predictions - labels) / N

def update_weights(features, labels, weights, lr):
'''
Features:(200, 3)
Labels: (200, 1)
Weights:(3, 1)
'''
N = len(features)

#1 - Get Predictions
predictions = predict(features, weights)

#2 Transpose features from (200, 3) to (3, 200)
# So we can multiply w the (200,1)  cost matrix.
# Returns a (3,1) matrix holding 3 partial derivatives --
# one for each feature -- representing the aggregate
# slope of the cost function across all observations
gradient = np.dot(features.T,  predictions - labels)

#3 Take the average cost derivative for each feature

#4 - Multiply the gradient by our learning rate

#5 - Subtract from our weights to minimize cost

return weights


## Classify

The final step is to convert assign predicted probabilities into class labels (0 or 1)

Code

def decision_boundary(prob):
return 1 if prob >= .5 else 0

def classify(preds):
'''
preds = N element array of predictions between 0 and 1
returns N element array of 0s (False) and 1s (True)
'''
decision_boundary = np.vectorize(decision_boundary)  #vectorized function
return decision_boundary(predictions).flatten()

#Example
Probabilities = [ 0.967  0.448   0.015  0.780  0.978  0.004]
Classifications = [1 0 0 1 1 0]


## Training

Our training process and code is the same as we used for linear regression.

Code

def train(features, labels, weights, lr, iters):
cost_history = []

for i in range(iters):
weights = update_weights(features, labels, weights, lr)

#Calculate error for auditing purposes
cost = cost_function(features, labels, weights)
cost_history.append(cost)

# Log Progress
if i % 1000 == 0:
print "iter: "+str(i) + " cost: "+str(cost)

return weights, cost_history


Logging
If our model is working, we should see our cost decrease after every iteration.

 iter: 0 cost: 0.635
iter: 1000 cost: 0.302
iter: 2000 cost: 0.264

• Final Cost: 0.2487
• Final Weights: [-8.197, .921, .738]

## Model Evaluation

### Log Loss

The Log Loss cost function.

### Accuracy

Also referred to as a "score."

def accuracy(predicted_labels, actual_labels):
diff = predicted_labels - actual_labels
return 1.0 - (float(np.count_nonzero(diff)) / len(diff))


### Decision Boundary

We can also visualize our models performance by graphically comparing our probability estimates to the actual labels. This involves splitting our observations by class (0 and 1) and assigning each observation its predicted probability.

def plot_decision_boundary(trues_preds, falses_preds, db):
fig = plt.figure()

no_of_preds = len(trues_preds) + len(falses_preds)

ax.scatter([i for i in range(len(trues_preds))],trues_preds, s=25, c='b', marker="o", label='Trues')
ax.scatter([i for i in range(len(falses_preds))],falses_preds, s=25, c='r', marker="s", label='Falses')

plt.legend(loc='upper right');
ax.set_title("Decision Boundary")
ax.set_xlabel('N/2')
ax.set_ylabel('Predicted Probability')
plt.axhline(.5, color='black')
plt.show()


## Multiclass Classification

Instead of ${\displaystyle y={0,1}}$ we will expand our definition so that ${\displaystyle y={0,1...n}}$. Basically we re-run binary classification multiple times, once for each class.

Steps

1. Divide the problem into n+1 binary classification problems (+1 because the index starts at 0?).
2. For each class...
3. Predict the probability the observations are in that single class.
4. prediction = ${\displaystyle max(probabilityoftheclasses)}$

For each sub-problem, we select one class (YES) and lump all the others into a second class (NO). Then we take the class with the highest predicted value.

Visualize
What does it looks like with multiple classes and lines?

## Code Examples

### Scikit-learn

Let's compare our performance to the LogisticRegression model provided by scikit-learn .

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

# Normalize grades to values between 0 and 1 for more efficient computation
normalized_range = sklearn.preprocessing.MinMaxScaler(feature_range=(-1,1))

# Extract Features + Labels
labels.shape =  (100,) #scikit expects this
features = normalized_range.fit_transform(features)

# Create Test/Train
features_train,features_test,labels_train,labels_test = train_test_split(features,labels,test_size=0.4)

# Scikit Logistic Regression
scikit_log_reg = LogisticRegression()
scikit_log_reg.fit(features_train,labels_train)

#Score is Mean Accuracy
scikit_score = clf.score(features_test,labels_test)
print 'Scikit score: ', scikit_score

#Our Mean Accuracy
observations, features, labels, weights = run()
probabilities = predict(features, weights).flatten()
classifications = classifier(probabilities)
our_acc = accuracy(classifications,labels.flatten())
print 'Our score: ',our_acc


Scikit score: 0.88
Our score: 0.89