Regression Basics

Difficulty: Intermediate Reading Time: 15 minutes

From Relationships to Predictions

In the correlation lesson, we learned how to measure whether two things move together. But correlation just tells you there's a relationship - it doesn't let you make specific predictions. That's where regression comes in.

2 4 6 8 10 5 10

Regression takes the relationship between two variables and draws a line through it. That line becomes a prediction tool: give me one number, and I'll estimate the other.

The Line of Best Fit

Imagine you have data on 50 houses - each house's size (in square feet) and its selling price. If you plot these on a graph, you'll see a scatter of dots trending upward: bigger houses generally cost more.

Regression finds the single straight line that comes closest to all those dots. This is called the line of best fit (or regression line). It doesn't pass through every point - real data is too messy for that. Instead, it minimizes the total distance between the line and all the points.

Example

You collect data on house sizes and prices in your neighborhood:

  • 1,000 sq ft house sold for $180,000
  • 1,400 sq ft house sold for $230,000
  • 1,800 sq ft house sold for $290,000
  • 2,200 sq ft house sold for $340,000
  • 2,600 sq ft house sold for $385,000

Regression draws the best line through these points. The line might be something like: Price = $50,000 + ($150 x Square Feet).

Now you can predict: a 2,000 sq ft house would be approximately $50,000 + ($150 x 2,000) = $350,000. That's the power of regression - it turns a pattern into a specific prediction.

The Equation of the Line

Every regression line can be written as a simple equation:

-3 -2 -1 0 1 2 3

Y = a + bX

Where:

  • Y is what you're trying to predict (the "outcome" or "dependent variable") - like house price.
  • X is what you're using to make the prediction (the "predictor" or "independent variable") - like house size.
  • b is the slope - how much Y changes for each one-unit increase in X. In our example, each additional square foot adds $150 to the predicted price.
  • a is the intercept - the predicted value of Y when X is zero. This sometimes makes practical sense (a theoretical "zero-size house" would cost $50,000 for the land) and sometimes doesn't.

What Makes the Line "Best"?

There are infinitely many lines you could draw through a scatter of points. Regression picks the one that minimizes the sum of squared errors. What does that mean in plain language?

For each data point, the "error" is the vertical distance between the point and the line. Some points fall above the line (the line underestimated) and some fall below (it overestimated). Regression squares each error (which makes all errors positive), adds them all up, and finds the line that makes this total as small as possible.

This method is called least squares regression, and it's been used for over 200 years.

Reading Regression Output

When software runs a regression, it gives you several key numbers. Here's what to look for:

2 4 6 8 10 2 4 6 8 10

R-squared (R²)

This tells you how much of the variation in your outcome is explained by your predictor. It ranges from 0 to 1 (or 0% to 100%).

  • R² = 0.85 means house size explains 85% of the variation in price. That's strong - size is a good predictor.
  • R² = 0.15 means the predictor only explains 15% of the variation. Other factors matter much more.

The Slope (and Its P-Value)

The slope tells you the direction and size of the relationship. The p-value attached to the slope tells you if the relationship is statistically significant - whether it's likely real or could be a fluke from a small sample.

Standard Error

This tells you roughly how far off your predictions will typically be. A standard error of $20,000 on house price predictions means most of your estimates will be within about $20,000 of the actual price - but some will be further off.

Example

A manager wants to predict monthly sales based on advertising spend. After collecting 24 months of data, regression gives:

  • Equation: Sales = $10,000 + ($5 x Ad Spend)
  • R² = 0.72 - Ad spending explains 72% of the variation in sales.
  • Slope p-value = 0.001 - The relationship is very unlikely to be a coincidence.
  • Standard error = $3,500 - Predictions will typically be off by about $3,500.

If the company spends $8,000 on advertising next month, the prediction is $10,000 + ($5 x $8,000) = $50,000 in sales. But they should expect the actual result to be somewhere between about $46,500 and $53,500.

Limitations and Cautions

Regression is incredibly useful, but it has important limitations:

  • Don't extrapolate too far. If your data covers houses from 800 to 3,000 sq ft, don't use the line to predict the price of a 10,000 sq ft mansion. The relationship may not continue in a straight line beyond your data range.
  • Correlation, not causation. Just because you can predict Y from X doesn't mean X causes Y. Ice cream sales predict drowning rates, but buying ice cream doesn't cause drowning.
  • One predictor is often not enough. House price depends on size, but also location, age, condition, and many other things. Simple regression uses one predictor; multiple regression (a topic for a later lesson) uses several.
  • Outliers can distort the line. A single unusual data point - like a tiny house that sold for millions because of its location - can pull the entire line off course.

Regression in Real Life

Regression is one of the most widely used statistical tools in the world:

  • Real estate: Estimating home values from size, location, and features.
  • Healthcare: Predicting patient outcomes from age, weight, and lifestyle factors.
  • Business: Forecasting sales from advertising budgets, season, and economic indicators.
  • Education: Predicting student performance from study hours, attendance, and prior grades.

Every time a website says "estimated delivery time" or "predicted price," there's likely a regression model running behind the scenes.

Key Takeaway

Regression finds the best straight line through your data, turning a relationship between two variables into a prediction tool. The equation Y = a + bX gives you a specific forecast for any value of X. R-squared tells you how much of the outcome the predictor explains, and the standard error tells you how accurate your predictions tend to be. It's one of the most practical tools in statistics, but remember: predictions work best within the range of your original data, and predicting something is not the same as causing it.