Logistic Regression

Difficulty: Advanced Reading Time: 15 minutes

When the Outcome Is Yes or No

Linear regression works beautifully when you are predicting a continuous number, like house price, temperature, or test score. But what happens when the thing you want to predict has only two possible outcomes? Will the customer buy or not buy? Will the patient recover or not recover? Will the email be spam or not spam? For these binary outcomes, linear regression breaks down, and logistic regression steps in.

The core problem with using linear regression for binary outcomes is that it can produce predictions below 0 or above 1, which make no sense as probabilities. If you tried to draw a straight line through data where the outcome is either 0 or 1, the line would inevitably extend into impossible territory. Logistic regression solves this by using a different shape entirely.

The Sigmoid Curve

Instead of fitting a straight line, logistic regression fits an S-shaped curve called the sigmoid (or logistic) function. This curve starts near 0 on the left, rises through 0.5 in the middle, and approaches 1 on the right, but never actually reaches 0 or 1. This means the predicted values are always valid probabilities, between 0 and 1.

5 10 15 0 0.2 0.4 0.6000000000000001 0.8 1

In the scatter plot above, imagine the x-axis represents years of experience and the y-axis represents whether someone passed a certification exam (1 = pass, 0 = fail). The raw data shows a clear pattern: more experience makes passing more likely. A logistic regression model would fit a sigmoid curve through these points, giving you the estimated probability of passing at any level of experience.

Mathematically, the model takes a linear combination of your input variables (just like regular regression) but then wraps it inside the sigmoid function. This means you get all the familiar concepts of coefficients and predictors but with an output that behaves as a probability.

Understanding Odds and Odds Ratios

Logistic regression does not directly predict probabilities in its internal math. Instead, it works with odds. If the probability of an event is 0.8, the odds are 0.8 / 0.2 = 4, meaning the event is four times more likely to happen than not. The model actually predicts the log of the odds (called the log-odds or logit), which is why it is sometimes called logit regression.

The coefficients in a logistic regression are expressed as log-odds, which are not intuitive. To make them interpretable, researchers convert them to odds ratios by taking e raised to the power of the coefficient. An odds ratio of 2.5 for a variable means that a one-unit increase in that variable multiplies the odds of the outcome by 2.5. An odds ratio of 1 means no effect, greater than 1 means higher odds, and less than 1 means lower odds.

Example

A hospital builds a logistic regression model to predict whether a patient will be readmitted within 30 days. The model finds that each additional chronic condition a patient has increases the odds of readmission by a factor of 1.4 (odds ratio = 1.4). A patient with 3 chronic conditions has roughly 1.4 times 1.4 times 1.4 = 2.74 times the odds of readmission compared to a patient with no chronic conditions. This gives doctors a clear, quantifiable risk factor.

When to Choose Logistic Over Linear Regression

The decision is straightforward: if your outcome variable is binary (two categories), use logistic regression. If your outcome is continuous, use linear regression. Trying to force a binary outcome into a linear model will give you misleading results, nonsensical predictions, and violated assumptions.

There are extensions of logistic regression for outcomes with more than two categories. Multinomial logistic regression handles cases where the outcome is one of three or more unordered categories (like choosing between bus, car, or bicycle). Ordinal logistic regression handles ordered categories (like rating something as low, medium, or high). But the standard binary version is by far the most common.

20 30 40 50 60 70 0 0.2 0.4 0.6000000000000001 0.8 1

The second scatter plot above might represent age (x-axis) versus whether a person has a particular health condition (y-axis). Notice how a straight line would be a poor fit, but an S-shaped curve would capture the transition from low probability at younger ages to high probability at older ages.

Interpreting and Evaluating the Model

Unlike linear regression, logistic regression does not use R-squared to measure fit. Instead, you evaluate it by how well it classifies cases. Common metrics include accuracy (what percentage of predictions were correct), sensitivity (how many actual positives did it catch), specificity (how many actual negatives did it correctly identify), and the area under the ROC curve (AUC), which summarizes overall classification ability on a scale from 0.5 (random guessing) to 1.0 (perfect).

You also need to choose a classification threshold. The model outputs a probability, but to make a yes/no decision, you need to pick a cutoff. Typically 0.5 is used: if the predicted probability is above 0.5, predict "yes." But in some contexts, you might lower the threshold. A medical screening test might use 0.3 to catch more true cases, accepting more false alarms as a trade-off.

Logistic regression assumes a linear relationship between the input variables and the log-odds of the outcome. It also assumes that observations are independent of each other. It is relatively simple compared to advanced machine learning methods, which is actually a strength: the results are interpretable, the odds ratios are meaningful, and the model is easy to explain to non-technical audiences.

Logistic Regression in the Real World

Logistic regression is everywhere. Banks use it to decide whether to approve a loan (default vs no default). Email providers use it to classify spam. Marketers use it to predict which customers will churn. Medical researchers use it to identify risk factors for disease. Its popularity comes from a combination of simplicity, interpretability, and strong performance on many real-world problems.

When you read a study that reports odds ratios, you are looking at the output of a logistic regression. Understanding what those numbers mean, that an odds ratio of 1.8 means 80% higher odds, not 80% higher probability, is essential for correctly interpreting medical and social science research.

Key Takeaway

Logistic regression is the standard method for predicting binary outcomes. It uses the sigmoid function to keep predictions between 0 and 1, and its coefficients are interpreted as odds ratios. Use it whenever your outcome is yes/no, pass/fail, or any two-category variable. While the math involves log-odds, the practical interpretation is clear: each predictor either increases or decreases the odds of the outcome by a quantifiable amount.