RelayMag
AnalysisNo. 60

The Truth About A/B Testing

RelayMagJune 20266 min read
Key takeaways

A surprising number of marketing teams treat A/B testing as proof that they are scientific. They run a test, watch a dashboard, declare a winner, and ship it. The ritual feels rigorous. Much of the time it is closer to a séance, a confident reading of noise that happens to wear the costume of data. The uncomfortable truth is that most marketing A/B tests cannot answer the question they claim to answer, and the people running them rarely know it.

This is not an argument against testing. Testing done with discipline is one of the few honest tools marketers have for separating what works from what they wish worked. The argument is against the version most teams actually practice, which produces certainty without evidence.

Why so many tests lie to you

Start with the math nobody wants to do. To detect an effect, you need enough data to tell a real difference apart from random fluctuation. The smaller the effect, the more data you need, and not by a little. Halving the effect you want to catch roughly quadruples the sample size required. Most changes marketers make, a new button color, a tweaked headline, a different hero image, move conversion by a tiny amount if they move it at all. Catching those tiny effects reliably takes enormous traffic. Most sites do not have it.

A test that lacks the traffic to detect the effect it is looking for is called underpowered. An underpowered test does not politely return no answer. It returns answers that are mostly garbage, sometimes showing a big lift, sometimes a big drop, swinging around because there is not enough signal to settle. Teams then pick the run that flattered the idea they already liked.

Then there is peeking. The standard significance threshold assumes you look once, at the end, after collecting a predetermined amount of data. The moment you check the results every morning and stop as soon as the line crosses into significance, you have broken that assumption. Random noise wanders. Watch it long enough and check it often enough, and it will eventually cross any line by chance. Stopping the instant it does means you are harvesting flukes. A test designed for a 5% false positive rate can easily produce false positives several times that often once you let people peek and stop early.

Duration is its own trap. Buying behavior, email engagement, and traffic mix all swing across the week. Weekday visitors are not weekend visitors. A test that runs Tuesday to Thursday measures a slice of your audience and pretends it measured all of them. New designs also enjoy a novelty bump that fades, so a short test can reward a change that will underperform within a month. A clean test needs to span whole weekly cycles, usually more than one, to average over that variation.

Finally, volume hides chance. Run twenty independent tests where nothing actually works and, on average, one will still cross a 5% significance threshold by pure luck. A team that runs dozens of tests a quarter and celebrates the handful that came back positive is often celebrating coin flips. The losers get forgotten, the lucky winners get a case study, and the org convinces itself it has a testing culture.

The discipline that makes it real

None of this means testing is hopeless. It means testing has rules, and the rules are not optional. The teams that get real value from experimentation tend to do the same unglamorous things.

Test bigger, less often

Here is the sharp conclusion most teams resist. If you do not have the traffic to detect small effects, then testing small changes is a waste no matter how disciplined you are. The honest move is not to test more carefully at the same tiny scale. It is to stop testing trivia.

Low-traffic sites should test bold swings, a genuinely different page, a new offer, a restructured funnel, the kind of change that produces an effect large enough to actually detect. Big effects need far less data to confirm. You will run fewer tests, but the ones you run will mean something. The endless parade of button-color and microcopy experiments is the clearest tell of cargo-cult testing, motion that mimics science without its substance, because those changes are exactly the ones a normal site can never measure.

This is counterintuitive precisely because tiny tweaks feel safe and constant testing feels productive. It is the opposite. Frequent small tests on thin traffic generate a steady stream of confident, wrong conclusions, and each wrong conclusion gets baked into the next decision.

What this comes down to

A/B testing is genuinely valuable when it is real science and actively harmful when it is theater, because theater does not merely fail to inform, it manufactures false confidence that crowds out judgment. The dividing line is not effort or enthusiasm. It is whether you fixed the sample size and duration up front, held to one hypothesis, refused to peek, and were willing to accept a null result. If you did those things, trust the answer. If you did not, you do not have data. You have a story that happens to come with a chart.

The fix is mostly subtraction. Run fewer tests, make them bigger, plan them before you start, and be honest when they come back flat. A team that runs four well-powered tests a year learns more than one that runs forty underpowered ones, and it spends a lot less time being wrong with conviction.

R
RelayMag is an independent publication on marketing, search, and how companies get found.