INCOMPLETE. WORK IN PROGRESS!!!
- 1 Overview
- 2 Binary Classification
- 3 Sigmoid Function
- 4 Decision Boundary
- 5 Prediction Function
- 6 Cost Function
- 7 Gradient Descent
- 8 Classify
- 9 Training
- 10 Model Evaluation
- 11 Multiclass Classification
- 12 Code Examples
- 13 Resources
Logistic regression is a classification algorithm, used to estimate probabilities (Binary values like 0/1, yes/no, true/false) based on given set of independent variable(s). Its output values lies between 0 and 1. Prior to building a model, the features values are transformed using the logistic function (Sigmoid) to produce probability values that can be mapped to two or more classes.
Linear vs Logistic
Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different things:
- Linear Regression could help us predict the student's test score on a scale of 0 - 100. Linear regression predictions are continuous (numbers in a range).
- Logistic Regression could help use predict whether the student passed or failed. Logistic regression predictions are discrete (only specific values or categories are allowed). We can also view probability scores underlying the model's classifications.
- Binary (Pass/Fail)
- Multi (Cats, Dogs, Sheep)
- Ordinal (Low, Medium, High)
- Pros: Easy to implement, fast to train, returns probability scores
- Cons: Bad when too many features or too many classifications
Say we're given a dataset about student exam results and our goal is to predict whether a student will pass or fail based on number of hours slept and hours spent studying. We have two features (hours slept, hours studied) and two classes: passed (1) and failed (0).
Graphically we could represent our data with a scatter plot.
In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use Sigmoid to map predictions to probabilities.
- = output between 0 and 1 (probability estimate)
- = input to the function (your algorithm's prediction e.g. mx + b)
- = base of natural log (wikipedia)
Here is a graphical representation of the sigmoid function:
Transforming a real number to a value between 0 and 1.
def sigmoid(z): return 1 / (1 + np.exp(-z))
A decision boundary is a pretty simple concept. Logistic regression is a classification algorithm, the output should be a category: Yes/No, True/False, Red/Yellow/Orange. Our prediction function however returns a probability score between 0 and 1. A decision boundary is a threshold or tipping point that helps us decide which category to choose based on probability.
For example, if our threshold was .5 and our prediction function returned .7, we would classify this observation as positive. If our prediction was .2 we would classify the observation as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.
Using our knowledge of sigmoid functions and decision boundaries, we can now write a prediction function. A prediction function in logistic regression returns the probability of our observation being positive, True, or "Yes". We call this class 1 and its notation is P(class=1). As the probability gets closer to 1, our model is more confident that the observation is in class 1.
Let's use the same multiple linear equation from our Linear Regression wiki.
This time however we will transform the output using the sigmoid function to return a probability value between 0 and 1.
If the model returns .4 it believes there is only a 40% chance of passing. If our decision boundary was .5, we would categorize this observation as "Fail."
We wrap the sigmoid function over our old linear regression prediction function.
def predict(features, weights): ''' Returns 1D array of probabilities that the class label == 1 ''' return 1 / (1 + np.exp(-np.dot(features,weights)))
Unfortunately we can't (or at least shouldn't) use the same cost function (MSE) as we did for linear regression. Why? There is a great math explanation here and here, but for now I'll simply say it's because our prediction function is non-linear (due to sigmoid transform). Squaring this prediction as we do in MSE results in a non-convex function with many local minimums. If our cost function has many local minimums, gradient descent may not find the optimal global minimum.
Instead of Mean Squared Error, we use a cost function called Log Loss, also known as Negative Log-Likelihood or Cross-Entropy Loss. Log Loss can be divided into two separate cost functions, one for y=1 and one for y=0.
The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Source.
The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.
The above functions compressed into one
Multiplying by y and (1-y) in the above equation is a sneaky trick that let's us use the same equation to solve for both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform.
Vectorized cost function
# Using Mean Absolute Error def cost_function(features, labels, weights): ''' Features:(100,3) Labels: (100,1) Weights:(3,1) Returns 1D matrix of predictions Cost = ( log(predictions) + (1-labels)*log(1-predictions) ) / len(labels) ''' observations = len(labels) predictions = predict(features, weights) #Take the error when label=1 class1_cost = -labels*np.log(predictions) #Take the error when label=0 class2_cost = (1-labels)*np.log(1-predictions) #Take the sum of both costs cost = class1_cost - class2_cost #Take the average cost cost = cost.sum()/observations return cost
To minimize our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisticated optimization algorithms out there such as conjugate gradient, BFGS, and L-BFGS, but you don't have to worry about these. Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things!
This leads to an equally beautiful and convenient log loss derivative:
- is the derivative of cost with respect to weights
- is the actual class label (y=0 or y=1)
- is your model's prediction prior to applying sigmoid ()
- is your feature or feature vector.
Notice how this gradient is the same as the Mean Squared Error gradient in Linear Regression! The only difference is the hypothesis function.
- Calculate gradient average
- Multiply by learning rate
- Subtract from weights
Vectorized Gradient Descent
# Vectorized Gradient Descent # gradient = X.T * (X*W - y) / N # gradient = features.T * (predictions - labels) / N def update_weights(features, labels, weights, lr): ''' Features:(200, 3) Labels: (200, 1) Weights:(3, 1) ''' N = len(features) #1 - Get Predictions predictions = predict(features, weights) #2 Transpose features from (200, 3) to (3, 200) # So we can multiply w the (200,1) cost matrix. # Returns a (3,1) matrix holding 3 partial derivatives -- # one for each feature -- representing the aggregate # slope of the cost function across all observations gradient = np.dot(features.T, predictions - labels) #3 Take the average cost derivative for each feature gradient /= N #4 - Multiply the gradient by our learning rate gradient *= lr #5 - Subtract from our weights to minimize cost weights -= gradient return weights
The final step is to convert assign predicted probabilities into class labels (0 or 1)
def decision_boundary(prob): return 1 if prob >= .5 else 0 def classify(preds): ''' preds = N element array of predictions between 0 and 1 returns N element array of 0s (False) and 1s (True) ''' decision_boundary = np.vectorize(decision_boundary) #vectorized function return decision_boundary(predictions).flatten() #Example Probabilities = [ 0.967 0.448 0.015 0.780 0.978 0.004] Classifications = [1 0 0 1 1 0]
Our training process and code is the same as we used for linear regression.
def train(features, labels, weights, lr, iters): cost_history =  for i in range(iters): weights = update_weights(features, labels, weights, lr) #Calculate error for auditing purposes cost = cost_function(features, labels, weights) cost_history.append(cost) # Log Progress if i % 1000 == 0: print "iter: "+str(i) + " cost: "+str(cost) return weights, cost_history
If our model is working, we should see our cost decrease after every iteration.
iter: 0 cost: 0.635 iter: 1000 cost: 0.302 iter: 2000 cost: 0.264
- Final Cost: 0.2487
- Final Weights: [-8.197, .921, .738]
The Log Loss cost function.
Also referred to as a "score."
def accuracy(predicted_labels, actual_labels): diff = predicted_labels - actual_labels return 1.0 - (float(np.count_nonzero(diff)) / len(diff))
We can also visualize our models performance by graphically comparing our probability estimates to the actual labels. This involves splitting our observations by class (0 and 1) and assigning each observation its predicted probability.
def plot_decision_boundary(trues_preds, falses_preds, db): fig = plt.figure() ax = fig.add_subplot(111) no_of_preds = len(trues_preds) + len(falses_preds) ax.scatter([i for i in range(len(trues_preds))],trues_preds, s=25, c='b', marker="o", label='Trues') ax.scatter([i for i in range(len(falses_preds))],falses_preds, s=25, c='r', marker="s", label='Falses') plt.legend(loc='upper right'); ax.set_title("Decision Boundary") ax.set_xlabel('N/2') ax.set_ylabel('Predicted Probability') plt.axhline(.5, color='black') plt.show()
Instead of we will expand our definition so that . Basically we re-run binary classification multiple times, once for each class.
- Divide the problem into n+1 binary classification problems (+1 because the index starts at 0?).
- For each class...
- Predict the probability the observations are in that single class.
- prediction =
For each sub-problem, we select one class (YES) and lump all the others into a second class (NO). Then we take the class with the highest predicted value.
What does it looks like with multiple classes and lines?
Let's compare our performance to the LogisticRegression model provided by scikit-learn .
import sklearn from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split # Normalize grades to values between 0 and 1 for more efficient computation normalized_range = sklearn.preprocessing.MinMaxScaler(feature_range=(-1,1)) # Extract Features + Labels labels.shape = (100,) #scikit expects this features = normalized_range.fit_transform(features) # Create Test/Train features_train,features_test,labels_train,labels_test = train_test_split(features,labels,test_size=0.4) # Scikit Logistic Regression scikit_log_reg = LogisticRegression() scikit_log_reg.fit(features_train,labels_train) #Score is Mean Accuracy scikit_score = clf.score(features_test,labels_test) print 'Scikit score: ', scikit_score #Our Mean Accuracy observations, features, labels, weights = run() probabilities = predict(features, weights).flatten() classifications = classifier(probabilities) our_acc = accuracy(classifications,labels.flatten()) print 'Our score: ',our_acc
Scikit score: 0.88
Our score: 0.89