Lesson 3: Logistic Regression and Classification

All vs. One and One vs. All

by CAIS++

Last week, we structured our lesson primarily around written content in an effort to condense the information from the first few chapters of Professor Andrew Ng’s course on machine learning. (The videos were optional because watching them all would have taken too long.) This week, since you hopefully already have a grasp on the fundamental ideas behind machine learning, you should be able to get through most of his videos on classification without too much trouble. As a result, this week’s lesson is more video-focused rather than text-focused. Please watch all the videos below, and read our written recaps to make sure you have the core ideas behind logistic regression (a type of classification) down. The videos will take a little less than an hour total, so plan accordingly! (Feel free to speed up the videos if you’re short on time.)

Video: Motivation and Intro (8:09)

Recap: As you may have already realized, linear regression is not the best approach for all problems, specifically those that involve classification. A problem can be deemed a classification problem if the outputs are different discrete outcomes. For example, if we wanted to determine if a patient has cancer, we can label the outcomes as either yes or no. Or, if we wanted to label an image, we could classify it has an image of a person, snake, car, etc.. The point is that these outputs are not continuous – as a result, we need a special function to be able to approximate these discrete outputs. At first, one might think that a step function might do the job. However, a step function is not continuous, which makes doing necessary computations on it, such as finding partial derivatives, very difficult.

In the next video, you will see that the logistic regression uses a function that roughly approximates the step function called the sigmoid function. This function is smooth and continuous, and has very convenient derivative properties (i.e. very easy to take the derivative).

Video: Hypothesis Representation (7:24)

Recap: The most important aspect of the sigmoid function is that it can take the product of our input parameters and weights and squeeze it into outputs between 0 and 1. We can then set a threshold value (between 0 and 1) to then map these outputs to the appropriate class label. Everything above the threshold would be mapped to the class associated with the value 1. Likewise, everything below the threshold would be mapped to the class associated with 0. Next, we will see how we can visually divide our data based on their label. These visual descriptions are often referred to as “decision boundaries.”

Video: Decision Boundary (14:49)

Recap: Often times, we will want to visually divide our data based on the corresponding labels. For example, if we have data on whether or not someone has cancer or not, we will want to determine a function that marks a visual boundary that divides our data. This function will not only allow us to see how we are dividing and classifying our data, but will also allow us to predict outputs for new inputs that our model has not seen yet. Next arises the question on how we train our model so that it can accurately classify our data. Like the linear regression model, logistic regression also includes a cost function that achieves the same goal as linear regression. However, there are some fundamental differences due to the logistic regression being a model that classifies its input data.


Logistic Regression Cost Function (11:25),

Simplified Cost Function, Gradient Descent (10:14)

Recap: Since logistic regression addresses the inherently different problem of classification, we will need a different function in order to measure the error in our predictions. Basically, the logistic regression cost function takes into account the correct label, and penalizes the function accordingly. The second video shows how we can simplify our logistic cost function to make it easier to use and implement.

Video: Multi-Class Classification (6:15)

Recap: If we want to classify items between multiple classes (not just binary ones), we can set up multiple classification problems: one for each class we want to classify. For each class \(i\), we will train a classifier to distinguish between \(i\) and \(\text{not } i\). We will end up with multiple trained classifiers, each capable of predicting the probability of an item belonging to its own class. To find out which class a new item belongs to, we simply pick the class with the highest probability at the end.

Main Takeaways:

  1. What it means for a problem to be considered a “classification” problem
  2. What the sigmoid function is and why it helps us solve classification problems
  3. How we visualize our logistic regression classifiers
  4. Differences between the logistic regression cost function and the linear regression cost function
  5. How we can apply “one vs. all” logistic regression to allow for classification between multiple classes