A/B Testing

An A/B test (also called a split test or online controlled experiment) randomly assigns visitors to a control (A) or variant (B) experience and measures the difference in a target metric. The goal is to attribute observed differences to the change rather than to chance or external factors.

Statistical frameworks

Frequentist (null hypothesis significance testing)

Sets a predetermined sample size based on statistical power calculations before the test runs
Reports a p-value: probability of observing the result (or more extreme) if no true effect exists
Standard threshold: p < 0.05 (95% confidence); p < 0.01 (99%) for high-stakes changes
Avoids peeking — checking results mid-test inflates false positive rates dramatically
More defensible in organisations where results will be challenged by sceptics
Risk: "Peeking problem" — 43% of programmes end tests early, which invalidates results (Optimizely, 127k experiment study, as-of 2024)

Bayesian

Starts with a prior belief (e.g., historical conversion rate) and updates it as data arrives
Reports probability that variant beats control (e.g., "87% chance B is better") — more intuitive than p-values
Allows continuous monitoring without the same multiple-comparison inflation as frequentist peeking
Better suited to high-velocity environments or when data is limited and prior knowledge is useful
Industry is gradually shifting toward Bayesian frameworks for speed and flexibility (Convert.com, as-of 2025)

Contradiction — "Bayesian imposter" risk

Eppo blog (2024): Some platforms marketed as Bayesian actually run frequentist analysis under the hood. When not implemented carefully, a Bayesian-labelled tool can give exactly the same false-positive risks as frequentist peeking.
CXL (Alex Birkett, updated 2025): For most ecommerce practitioners, "does it even matter?" — discipline (correct duration, no peeking, pre-registered hypothesis) matters more than the statistical framework chosen.
Source: Eppo — Beware of the Bayesian Imposter; CXL — Bayesian vs Frequentist

Sample size and traffic requirements (as-of 2025)

Scenario	Visitors per variant required
2% baseline CVR, detect 10% relative lift (95% confidence)	~23,200
2% baseline CVR, detect 15% relative lift (95% confidence)	~50,000
Rule of thumb — reliable ecommerce test	≥30,000 visitors per variation
Minimum viable programme	≥10,000 monthly visitors or ≥200 monthly transactions

Key levers:

MDE (Minimum Detectable Effect): smaller expected lift → exponentially larger sample needed
Baseline CVR: lower baseline → higher variance → more traffic needed
Fashion-specific: average CVR 2–5% (desktop and mobile); use the lower end for conservative planning
Small-traffic stores: only test big, bold changes with expected lift ≥30%; don't test button colours

Significance thresholds:

95% (p < 0.05) — standard for most ecommerce tests
99% (p < 0.01) — high-stakes changes (pricing, payment methods, checkout restructures)

Sources: GuessTheTest, Nudgenow, IronLinx, Digital Authority Partners (as-of 2025)

Experiment maturity levels (as-of 2024)

From Optimizely's 127,000-experiment dataset:

Only 1 in 10 companies reach transformative experimentation maturity
33% of companies at beginner level have tested for less than a year
67% don't know how long they've been testing — experimentation hasn't been formalised
Companies with mature programmes generate 30–50% higher revenue growth than peers relying on traditional decision-making (as-of 2024)

Signs of a mature programme

Executive sponsorship and cross-team collaboration
Shared test results across functions (marketing, product, UX, engineering)
Bi-weekly or monthly experiment review meetings with mixed stakeholders
A documented test log and knowledge base
Pre-registration of hypotheses and duration before launch

Compounding wins > single breakthroughs

In one documented ecommerce case, no individual test produced more than a 14% lift — but 30+ incremental wins stacked to nearly 50% cumulative improvement. Velocity matters more than the size of any single bet. (source: search results, attributable to Optimizely/ExperimentFlow data, as-of 2024)

Common programme failure: wrong tool order

The most common mistake is over-investing in an enterprise testing platform before building the testing muscle to use it effectively. Start with a tool matched to current volume and maturity; upgrade as the programme scales. (Zigpoll / Kameleoon, as-of 2025)

20 critical mistakes (consolidated)

Before the test

No hypothesis — testing without stating what, why, and what outcome is expected
Wrong page or feature — testing low-traffic pages that can never reach significance
Wrong metrics — measuring page views on a checkout test instead of cart completion
Ignoring segmentation — not splitting results by device, channel, or user tenure before concluding

During the test

Including unaffected users — dilutes results; filter ineligible users before assignment (PostHog)
Peeking / stopping early — the peeking problem inflates false positives; pre-set duration and don't check mid-run
Simpson's paradox — aggregate results can flip when broken into subgroups (e.g., mobile vs desktop). Always segment (PostHog)
No A/A test — validate the tool works before trusting experiment data
Ignoring mobile traffic — mobile often majority of traffic but gets collapsed into desktop results
External factors — holiday sales, campaigns, algorithm changes can masquerade as treatment effects
No predetermined duration — without it you can't distinguish intermediate from final results
Skipping counter metrics — sign-ups may go up while LTV goes down; monitor both (PostHog)

After the test

Stopping at first significance signal — p-values fluctuate; wait for planned end date
Misinterpreting results — slight lift ≠ breakthrough; use confidence intervals, not just point estimates
Overestimating long-term impact — short-term CVR gains don't always sustain
Copying others' tests without adaptation — what worked for competitor A reflects their audience, not yours
Not documenting — without a test log, teams repeat mistakes and lose institutional knowledge
Neglecting qualitative data — A/B tells you what changed; user research tells you why
Relying too much on tests — not everything that matters can be measured; some UX improvements worth shipping despite flat metrics (PostHog)
Not iterating — a winning test is a hypothesis for the next test, not a final answer

Sources: FigPii (20 mistakes, as-of 2024-04), PostHog (Lior Neu-ner, as-of 2024-08)

Ecommerce test ideas by funnel stage (as-of 2026)

Homepage / category

Hero message (value proposition vs. promotional offer)
Navigation labels and structure
Trust signals placement (reviews, security badges)

Product Detail Page (PDP)

Primary image format (lifestyle vs. product-only)
Social proof placement (above vs. below fold)
CTA copy and colour
Urgency signals (stock levels, delivery countdown)

Checkout

Guest checkout prominence (Baymard: 62% of sites bury this — high-impact test)
Form field count (fewer fields)
Trust badge placement
Progress indicator style
Error message specificity

Sources: Growth Engines (50+ test ideas, as-of 2026), Baymard Institute

Contradiction — should you test everything?

Leading experimentation voices (Booking.com model, Convert.com): test everything — volume and velocity compound into outsized gains.
Towards Data Science (2024): "Not A/B testing everything is fine" — tests take time, add complexity, and have false-positive costs; strategic testing of high-impact surfaces beats blanket experimentation.
No resolution; correct approach likely depends on traffic volume and organisational maturity.
Source: Towards Data Science article; Convert.com experimentation handbook

Tools landscape (not exhaustive)

Client-side: Optimizely, VWO, AB Tasty, FigPii, Dynamic Yield
Full-stack / product: PostHog, Statsig, Eppo, LaunchDarkly, GrowthBook (open source)
Platform-native: Shopify Experiments (via Shopify Markets/themes), Adobe Commerce Storefront Experimentation (as-of 2026)

Conversion Rate Optimisation — CRO is the overarching discipline; A/B testing is the primary evidence mechanism
Mobile Commerce — mobile vs desktop segment splits are essential; results can invert (Simpson's paradox)
Personalisation — beyond A/B: segment-specific experiences and recommendation-engine testing
Product Detail Page (PDP) — highest-ROI surface for ecommerce testing after checkout
Social Proof — commonly tested element on PDP and checkout
Guest Checkout — Baymard: 62% of sites get this wrong; a ready-made high-confidence hypothesis

A/B Testing

A/B Testing

Statistical frameworks

Frequentist (null hypothesis significance testing)

Bayesian

Contradiction — "Bayesian imposter" risk

Sample size and traffic requirements (as-of 2025)

Experiment maturity levels (as-of 2024)

Signs of a mature programme

Compounding wins > single breakthroughs

Common programme failure: wrong tool order

20 critical mistakes (consolidated)

Before the test

During the test

After the test

Ecommerce test ideas by funnel stage (as-of 2026)

Homepage / category

Product Detail Page (PDP)

Checkout

Contradiction — should you test everything?

Tools landscape (not exhaustive)

Related concepts