Understanding Outliers

Difficulty: Beginner Reading Time: 8 minutes

What Is an Outlier?

An outlier is a value that is noticeably different from the rest of the data. It sits far away from where most of the other values cluster. Outliers are not automatically errors - sometimes they are the most interesting part of your data.

22 26 30 34 38 42 45 27.8 25.5
Example

Nine students take a quiz. Their scores are: 62, 65, 67, 68, 70, 71, 72, 74, 98

Most scores are bunched between 62 and 74. The score of 98 stands out - it is much higher than everything else. That is an outlier.

Outliers can appear on either end. A value can be unusually high or unusually low. Sometimes there is more than one outlier in a dataset.

How Outliers Affect the Mean

As we learned in the lesson on mean, median, and mode, the mean is sensitive to extreme values. This is the most important practical consequence of outliers.

Example

A small company has 6 employees with these annual salaries:

$38,000 · $40,000 · $42,000 · $44,000 · $45,000 · $250,000

With the outlier ($250,000):

  • Mean = $76,500
  • Median = $43,000

Without the outlier:

  • Mean = $41,800
  • Median = $42,000

Removing the one high salary drops the mean by nearly $35,000, but the median barely changes. This is why the median is often preferred when outliers are present.

How Outliers Affect Other Statistics

It is not just the mean. Outliers can also inflate the range, variance, and standard deviation, making the data appear more spread out than it really is for most values.

8 20-24 15 25-29 5 30-34 2 35-39 1 40+
Example

Daily customers at a small bakery over 7 days: 45, 48, 50, 52, 47, 51, 310

On six of those days, traffic was steady around 45-52 customers. But on one day, a local event brought in 310 people.

Range with the outlier: 310 − 45 = 265

Range without it: 52 − 45 = 7

The outlier makes the bakery look wildly inconsistent when, in reality, it has very steady daily traffic.

Where Do Outliers Come From?

Understanding why an outlier exists helps you decide what to do with it. There are several common causes:

1. Data Entry Errors

Someone types 1000 instead of 100. A sensor glitches and records a temperature of 500°C in a room. These are mistakes, and they should be corrected or removed.

2. Measurement Errors

A scale was not calibrated properly, or a survey question was confusing and someone misunderstood it. Again, these outliers do not represent real information and can usually be set aside.

3. Genuine Extreme Values

Sometimes reality produces extreme numbers. A professional athlete in a recreational league, a mansion in a neighborhood of modest homes, a viral social media post among hundreds of normal ones. These outliers are real and meaningful.

4. Different Populations Mixed Together

If you accidentally combine data from two very different groups - say, salaries of part-time workers and CEOs in the same dataset - the CEO salaries will look like outliers. This often means the data should be analyzed in separate groups.

When to Keep Outliers

Outliers should be kept when they represent genuine, accurate data points that are part of the story you are trying to understand.

22 23 24 25 26 27 28 25.0 25.0
Example

A hospital tracks how long patients wait in the emergency room. Most wait 20-45 minutes, but one patient waited 6 hours due to a system failure.

That 6-hour wait is an outlier, but it is real. Removing it would hide a serious problem. In this case, the outlier is arguably the most important data point.

In general, keep outliers when:

  • They are accurate measurements (not errors)
  • They represent important events or patterns
  • Removing them would hide information your audience needs
  • You are trying to understand the full range of what is possible

When to Remove (or Separate) Outliers

Sometimes outliers distort your analysis so much that they prevent you from understanding the main pattern in your data.

Example

You are analyzing typical grocery spending in a neighborhood. Most households spend $300-$600 per month. One household spends $8,000 because they run a catering business from home.

Including that household would skew your averages and give a misleading picture of typical spending. You might report the results both ways: "The average household spends $420 per month, excluding one commercial buyer who spends $8,000."

Consider removing or separately reporting outliers when:

  • They are caused by errors (typos, equipment malfunctions)
  • They come from a different population than the one you are studying
  • They distort the analysis of the main group so much that patterns become invisible
  • You clearly note their removal so your analysis stays honest

The Golden Rule: Always Report What You Did

Whether you keep outliers or remove them, transparency is essential. If you remove data points, say so. Explain why. Show the results both with and without the outliers when possible. Quietly dropping inconvenient data points is one of the most common ways statistics get manipulated, even unintentionally.

Simple Methods for Identifying Outliers

How do you decide if a value qualifies as an outlier? Here are two straightforward approaches:

The standard deviation method: Any value more than 2 or 3 standard deviations from the mean is often considered an outlier. Using the 68-95-99.7 rule, a value beyond 3 standard deviations occurs less than 0.3% of the time in bell-shaped data.

The IQR method: Find the middle 50% of your data (the interquartile range, or IQR). Any value more than 1.5 times the IQR below the first quartile or above the third quartile is flagged as an outlier. This is the method behind the "whiskers" in box-and-whisker plots.

You do not need to memorize these formulas right now. The important thing is knowing that there are systematic ways to identify outliers - it is not just a gut feeling.

Key Takeaway

Outliers are data points that sit far from the rest of your values. They can be caused by errors, genuine extreme events, or mixed populations. Outliers pull the mean, inflate the range, and increase the standard deviation. The right response depends on context: keep them when they are real and important, remove or report them separately when they distort your understanding of the main pattern. Whatever you decide, always be transparent about it.