A/B Testing

Difficulty: Intermediate Reading Time: 12 minutes

The Simplest Experiment

An A/B test is one of the simplest and most powerful forms of experiment. You take two versions of something, show version A to one group of people and version B to another, and measure which one performs better. Tech companies use A/B tests to optimize everything from button colors to pricing pages to entire product features. But the same logic applies in medicine (drug vs placebo), education (teaching method A vs B), and marketing (email subject line A vs B).

The power of A/B testing comes from randomization. By randomly assigning people to group A or group B, you eliminate the influence of confounding variables. Any difference in outcomes between the groups can be attributed to the change you made, not to pre-existing differences between the people. This is the same principle behind randomized controlled trials in medicine, which are considered the gold standard of evidence.

Designing the Experiment

A good A/B test starts with a clear hypothesis and a single measurable metric. "We believe that changing the sign-up button from green to blue will increase the click-through rate." The metric is click-through rate. The control (A) is the green button. The treatment (B) is the blue button. Everything else stays exactly the same.

This "change one thing" principle is critical. If you change the button color, the text, and the page layout all at once, and conversions go up, you have no idea which change caused the improvement. Multivariate testing exists for testing multiple changes simultaneously, but it requires much larger samples and more complex analysis.

You also need to decide in advance how long the test will run. This depends on your sample size calculation, which accounts for your current baseline conversion rate, the minimum detectable effect (the smallest improvement you care about), and your desired confidence level. Running a test without a predetermined sample size is one of the most common mistakes in A/B testing.

Sample Size: Why It Matters So Much

Sample size determines the statistical power of your test, which is its ability to detect a real effect when one exists. With too few visitors, you might miss a genuine improvement because the results are too noisy to be conclusive. With too many, you waste time and resources running the test longer than necessary.

3.2 Control (A) 3.8 Variant (B)

Suppose your current conversion rate is 3.2% and you want to detect at least a 0.5 percentage point improvement. Depending on your confidence level and power requirements, you might need 15,000 to 30,000 visitors per group. If you only have 1,000 visitors per group, the test will be underpowered and you will likely get an inconclusive result, even if the new version truly is better.

Control (A) 2.8 3.6
Variant (B) 3.3 4.3

The confidence intervals above show the estimated conversion rates for each group. Notice that they overlap slightly. Whether this difference is statistically significant depends on the exact sample size and the degree of overlap. When confidence intervals barely overlap or do not overlap at all, you have stronger evidence that the difference is real.

Statistical Significance in A/B Tests

After collecting enough data, you run a statistical test (usually a two-proportion z-test or a chi-square test) to determine whether the difference between groups is statistically significant. The result is a p-value. If the p-value is below your threshold (typically 0.05), you conclude that the difference is unlikely to be due to chance alone.

But significance does not tell you the whole story. A statistically significant improvement of 0.02 percentage points is real in the statistical sense but probably not worth the engineering effort to implement. Always pair your significance test with a look at the actual effect size. Does a 0.5 percentage point increase in conversion translate to meaningful revenue? That depends on your business context.

Some teams use Bayesian approaches instead of frequentist p-values. Bayesian A/B testing gives you a direct probability statement: "there is a 94% probability that variant B is better than variant A." Many practitioners find this more intuitive than the standard p-value, which answers a subtly different question.

Common Pitfalls

Peeking at results too early. This is the most common and most damaging mistake. If you check your results every day and stop the test the first time you see significance, you will dramatically increase your false positive rate. Statistical tests are designed to be evaluated once, at a predetermined sample size. If you must monitor results as they come in, use sequential testing methods that account for repeated looks.

Running too many variants. Testing five versions at once (A/B/C/D/E) sounds efficient, but it multiplies the chances of a false positive. With five variants and a 5% significance threshold, you have roughly a 19% chance of at least one false positive. You need to apply corrections for multiple comparisons or run larger samples.

Example

A SaaS company runs an A/B test on their pricing page. After three days, the product manager checks and sees that variant B has a 15% higher conversion rate with a p-value of 0.03. Excited, they stop the test and roll out variant B. Two weeks later, they realize conversions have not actually improved. What happened? The early peeking caught a random fluctuation. If they had waited for the full planned sample size of 10,000 visitors per group, the effect would have shrunk to 2% and would not have been significant.

Ignoring segments. An A/B test might show no overall difference, but variant B could be performing much better for mobile users while performing worse for desktop users. These effects cancel out in the aggregate. Segment analysis can reveal valuable insights, but be careful: testing many segments also increases false positive risk.

Testing without enough traffic. Small websites or products with low traffic often cannot reach the required sample sizes within a reasonable timeframe. Running a test for three months introduces seasonal effects and other confounders. If your traffic is too low for the effect you want to detect, consider testing a larger change (which needs fewer samples to detect) or using qualitative methods instead.

Key Takeaway

A/B testing is a randomized experiment that compares two versions to find which performs better. Good tests require a clear hypothesis, a single key metric, a pre-calculated sample size, and the discipline to wait for full results before drawing conclusions. The biggest pitfalls are peeking at results too early, testing too many variants without correction, and confusing statistical significance with practical importance. Done right, A/B testing gives you causal evidence rather than guesswork.