Effect Size

Difficulty: Intermediate Reading Time: 12 minutes

The Problem with P-Values Alone

You run a study, get a p-value of 0.03, and declare your result "statistically significant." But what does that actually tell you? A p-value tells you how surprising your results would be if there were truly no effect. It does not tell you how big or important the effect is.

Here is the problem: with a large enough sample, almost any difference -- no matter how trivially small -- will become statistically significant. If you compare the average height of 100,000 people who drink coffee with 100,000 people who do not, you might find a statistically significant difference of 0.2 centimeters. The p-value might be tiny (p = 0.001), but the difference is meaningless in practical terms. Nobody cares about a fifth of a centimeter.

This is where effect size comes in. Effect size measures the magnitude of a difference or relationship, independent of sample size. It answers the question that really matters: how big is this effect, and does it matter in the real world?

Cohen's d: Measuring the Difference

The most widely used effect size measure for comparing two groups is Cohen's d. It expresses the difference between two group means in terms of standard deviations. The formula is straightforward: take the difference between the two means and divide by the pooled standard deviation.

For example, if Group A has a mean of 75 and Group B has a mean of 80, and the pooled standard deviation is 10, then Cohen's d = (80 - 75) / 10 = 0.5. This means the two groups are separated by half a standard deviation.

-3 -2 -1 0 1 2 3

The visualization above shows a standard normal curve. The shaded region in the center represents the overlap zone between two groups separated by a small-to-medium effect. The more the curves overlap, the smaller the practical difference between the groups.

Small, Medium, and Large Effects

Jacob Cohen, the psychologist who popularized this measure, proposed rough benchmarks for interpreting effect sizes:

  • Small effect (d = 0.2): The difference is real but hard to see with the naked eye. The two groups overlap almost completely. Example: the difference in height between 15-year-old and 16-year-old girls.
  • Medium effect (d = 0.5): The difference is noticeable to careful observers. There is meaningful separation between the groups, though substantial overlap remains. Example: the difference in height between 14-year-old and 18-year-old girls.
  • Large effect (d = 0.8): The difference is obvious and practically significant. The groups are clearly different, though some overlap exists. Example: the difference in height between 13-year-old and 18-year-old girls.
20 Small (0.2) 50 Medium (0.5) 80 Large (0.8)

These benchmarks are guidelines, not rigid rules. In some fields, a "small" effect size is enormously important. A medication that reduces heart attack risk by a small amount (d = 0.2) could save thousands of lives when applied to millions of people. Context determines whether an effect is practically meaningful.

Why Effect Size Matters for Decision-Making

Consider two scenarios. Study A tests a new employee training program on 20 people and finds a 10-point improvement in performance scores (p = 0.08, d = 0.9). Study B tests the same program on 5,000 people and finds a 1-point improvement (p = 0.001, d = 0.05). Which study provides stronger evidence that the program is worth adopting?

If you only look at p-values, Study B "wins" -- its result is highly significant. But the effect size tells a different story. Study A found a large, meaningful improvement. Study B found a trivially small improvement that just happened to reach significance because of the massive sample size. A thoughtful decision-maker would take Study A's result more seriously, while recognizing it needs replication with a larger sample.

This is why many scientific journals now require effect sizes to be reported alongside p-values. The American Psychological Association has recommended reporting effect sizes since 1994. A complete picture of a finding requires both: the p-value tells you whether the effect is likely real, and the effect size tells you whether it is worth caring about.

-3 -2 -1 0 1 2 3

Other Measures of Effect Size

Cohen's d is not the only effect size metric. Different situations call for different measures. Pearson's r (the correlation coefficient) is itself an effect size for the strength of a relationship between two variables, with benchmarks of 0.1 (small), 0.3 (medium), and 0.5 (large). Eta-squared and partial eta-squared are used with ANOVA to express how much of the total variance is explained by group membership. Odds ratios are common in medical research for comparing the likelihood of outcomes between groups.

The choice of measure depends on your analysis type. For comparing two means, use Cohen's d. For correlations, use r. For ANOVA, use eta-squared. For binary outcomes, use odds ratios. What matters is that you always report some measure of effect magnitude, not just a p-value.

Practical Applications

Effect sizes are essential for power analysis -- determining how many participants you need before running a study. If you expect a small effect, you need a much larger sample to detect it reliably than if you expect a large effect. Planning sample size without considering effect size is like packing for a trip without knowing the destination.

Effect sizes also make meta-analysis possible. When researchers combine results from many studies on the same topic, they convert each study's results into a common effect size metric. This allows them to synthesize evidence across studies that used different sample sizes, different scales, and different populations. A single study might be inconclusive, but the pooled effect size across 50 studies can be very informative.

30 Drug A (d=0.3) 60 Drug B (d=0.6) 10 Drug C (d=0.1)

The chart above compares hypothetical effect sizes for three drugs treating the same condition. All three might have statistically significant p-values, but the practical differences are dramatic. Drug B has twice the effect of Drug A and six times the effect of Drug C. A doctor choosing among them should focus on effect size, not just significance.

Key Takeaway

Statistical significance tells you whether an effect is likely real, but effect size tells you whether it matters. Cohen's d is the standard metric for comparing two groups, with benchmarks of 0.2 (small), 0.5 (medium), and 0.8 (large). Always report effect sizes alongside p-values. With large samples, even trivial differences become "significant," so effect size is essential for sound decision-making, power analysis, and comparing results across studies.