A/B Testing & Experimentation

How to Calculate Sample Size for an A/B Test

TL;DR

Calculate the sample size per variant with n = 16 × p × (1−p) / δ², where p is your baseline conversion rate and δ is the absolute lift you want to detect. The 16 bakes in 95% confidence and ~80% power. For a 3% baseline detecting a 1-point lift, that's about 4,700 visitors per variant. Always set this before the test starts, not after.

To calculate the sample size for an A/B test, use n = 16 × p × (1 − p) / δ² per variant, where p is your baseline conversion rate and δ is the absolute lift you want to detect. The constant 16 bakes in 95% confidence and roughly 80% power. Compute it before the test starts — the sample size is your honest stopping point.

Where the formula comes from

The full sample-size formula for comparing two proportions is:

n per group = (z_α/2 + z_β)² × [ p₁(1−p₁) + p₂(1−p₂) ] / (p₂ − p₁)²

For the usual choices — 95% confidence (z_α/2 = 1.96) and 80% power (z_β = 0.84) — (1.96 + 0.84)² ≈ 7.84. Approximating p₁ ≈ p₂ ≈ p, the bracket becomes 2p(1−p), so:

n ≈ 2 × 7.84 × p(1−p) / δ²  ≈  16 × p(1−p) / δ²

That's the rule of thumb: the 16 is 2 × (1.96 + 0.84)², rounded. It's accurate enough for planning and easy to remember.

A runnable function

// Visitors per variant for 95% confidence, ~80% power.
// baseline: conversion rate (e.g. 0.03). mde: absolute lift to detect (e.g. 0.01).
function sampleSizePerGroup(baseline, mde) {
  const p = baseline
  return Math.ceil((16 * p * (1 - p)) / (mde * mde))
}

sampleSizePerGroup(0.03, 0.01) // 4656  → ~4,700 per variant
sampleSizePerGroup(0.05, 0.01) // 7600  → ~7,600 per variant
sampleSizePerGroup(0.03, 0.005) // 18624 → ~18,600 per variant

Worked example

Say your landing page converts at 3% and you only care about a lift if it's at least 1 percentage point (to 4%). Then p = 0.03, δ = 0.01:

n = 16 × 0.03 × 0.97 / (0.01)²
  = 16 × 0.0291 / 0.0001
  = 0.4656 / 0.0001
  = 4,656  ≈ 4,700 per variant

So you need about 4,700 visitors per variant — roughly 9,400 total — before you evaluate. At 2,000 visitors a week split across two variants, that's about nine to ten weeks. Now you know the commitment before you start, not after.

Two parameters worth understanding

  • Confidence (95%) controls false positives — declaring a winner that isn't real. Raising it raises the sample.
  • Power (80%) controls false negatives — missing a real winner. The 16 assumes 80%; if you want 90% power, the constant rises to about 21.

The defaults (95% / 80%) are standard for a reason; only change them deliberately.

Calculate first, evaluate once

The single most important rule: compute the sample size before the test and treat it as a fixed stopping point. That's what stops you from peeking and calling a noisy early result a win. Calculating the number afterward to rationalise a result you already saw is just dressing up a guess. For why small lifts are so expensive to detect, see how many visitors you need; for checking the result, see is it significant.

If you'd rather not run the traffic maths at all and have high-impact fixes found and shipped as Pull Requests, that's what Velyr does.

Frequently asked questions

How do you calculate sample size for an A/B test?

Use n = 16 × p × (1−p) / δ² per variant, where p is the baseline conversion rate and δ is the absolute minimum lift you want to detect. The constant 16 encodes 95% confidence and roughly 80% power. Compute it before the test so you have a fixed, honest stopping point.

What is power in an A/B test?

Power is the probability of detecting a real effect of the size you care about. The standard target is 80%, meaning if the true lift is at least your minimum detectable effect, you'll catch it 80% of the time. Higher power needs a larger sample; the 16 in the formula assumes ~80%.

Should I calculate sample size before or after the test?

Before — always. The sample size defines when you stop and evaluate, which is what protects you from peeking and false positives. Calculating it afterward to justify a result you already saw defeats the purpose and invites self-deception.

Velyr is an AI growth agent that ships one weekly conversion fix as a GitHub Pull Request — you approve it over Telegram, and it rolls itself back if the numbers drop.

Start the Growth Agent