On this page
concept

A/B Testing

Created 2026-06-15 26 connections

A/B Testing

An A/B test (also called a split test or online controlled experiment) randomly assigns visitors to a control (A) or variant (B) experience and measures the difference in a target metric. The goal is to attribute observed differences to the change rather than to chance or external factors.


Statistical frameworks

Frequentist (null hypothesis significance testing)

  • Sets a predetermined sample size based on statistical power calculations before the test runs
  • Reports a p-value: probability of observing the result (or more extreme) if no true effect exists
  • Standard threshold: p < 0.05 (95% confidence); p < 0.01 (99%) for high-stakes changes
  • Avoids peeking — checking results mid-test inflates false positive rates dramatically
  • More defensible in organisations where results will be challenged by sceptics
  • Risk: "Peeking problem" — 43% of programmes end tests early, which invalidates results (Optimizely, 127k experiment study, as-of 2024)

Bayesian

  • Starts with a prior belief (e.g., historical conversion rate) and updates it as data arrives
  • Reports probability that variant beats control (e.g., "87% chance B is better") — more intuitive than p-values
  • Allows continuous monitoring without the same multiple-comparison inflation as frequentist peeking
  • Better suited to high-velocity environments or when data is limited and prior knowledge is useful
  • Industry is gradually shifting toward Bayesian frameworks for speed and flexibility (Convert.com, as-of 2025)

Contradiction — "Bayesian imposter" risk

Eppo blog (2024): Some platforms marketed as Bayesian actually run frequentist analysis under the hood. When not implemented carefully, a Bayesian-labelled tool can give exactly the same false-positive risks as frequentist peeking.
CXL (Alex Birkett, updated 2025): For most ecommerce practitioners, "does it even matter?" — discipline (correct duration, no peeking, pre-registered hypothesis) matters more than the statistical framework chosen.
Source: Eppo — Beware of the Bayesian Imposter; CXL — Bayesian vs Frequentist


Sample size and traffic requirements (as-of 2025)

ScenarioVisitors per variant required
2% baseline CVR, detect 10% relative lift (95% confidence)~23,200
2% baseline CVR, detect 15% relative lift (95% confidence)~50,000
Rule of thumb — reliable ecommerce test≥30,000 visitors per variation
Minimum viable programme≥10,000 monthly visitors or ≥200 monthly transactions

Key levers:

  • MDE (Minimum Detectable Effect): smaller expected lift → exponentially larger sample needed
  • Baseline CVR: lower baseline → higher variance → more traffic needed
  • Fashion-specific: average CVR 2–5% (desktop and mobile); use the lower end for conservative planning
  • Small-traffic stores: only test big, bold changes with expected lift ≥30%; don't test button colours

Significance thresholds:

  • 95% (p < 0.05) — standard for most ecommerce tests
  • 99% (p < 0.01) — high-stakes changes (pricing, payment methods, checkout restructures)

Sources: GuessTheTest, Nudgenow, IronLinx, Digital Authority Partners (as-of 2025)


Experiment maturity levels (as-of 2024)

From Optimizely's 127,000-experiment dataset:

  • Only 1 in 10 companies reach transformative experimentation maturity
  • 33% of companies at beginner level have tested for less than a year
  • 67% don't know how long they've been testing — experimentation hasn't been formalised
  • Companies with mature programmes generate 30–50% higher revenue growth than peers relying on traditional decision-making (as-of 2024)

Signs of a mature programme

  • Executive sponsorship and cross-team collaboration
  • Shared test results across functions (marketing, product, UX, engineering)
  • Bi-weekly or monthly experiment review meetings with mixed stakeholders
  • A documented test log and knowledge base
  • Pre-registration of hypotheses and duration before launch

Compounding wins > single breakthroughs

In one documented ecommerce case, no individual test produced more than a 14% lift — but 30+ incremental wins stacked to nearly 50% cumulative improvement. Velocity matters more than the size of any single bet. (source: search results, attributable to Optimizely/ExperimentFlow data, as-of 2024)

Common programme failure: wrong tool order

The most common mistake is over-investing in an enterprise testing platform before building the testing muscle to use it effectively. Start with a tool matched to current volume and maturity; upgrade as the programme scales. (Zigpoll / Kameleoon, as-of 2025)


20 critical mistakes (consolidated)

Before the test

  1. No hypothesis — testing without stating what, why, and what outcome is expected
  2. Wrong page or feature — testing low-traffic pages that can never reach significance
  3. Wrong metrics — measuring page views on a checkout test instead of cart completion
  4. Ignoring segmentation — not splitting results by device, channel, or user tenure before concluding

During the test

  1. Including unaffected users — dilutes results; filter ineligible users before assignment (PostHog)
  2. Peeking / stopping early — the peeking problem inflates false positives; pre-set duration and don't check mid-run
  3. Simpson's paradox — aggregate results can flip when broken into subgroups (e.g., mobile vs desktop). Always segment (PostHog)
  4. No A/A test — validate the tool works before trusting experiment data
  5. Ignoring mobile traffic — mobile often majority of traffic but gets collapsed into desktop results
  6. External factors — holiday sales, campaigns, algorithm changes can masquerade as treatment effects
  7. No predetermined duration — without it you can't distinguish intermediate from final results
  8. Skipping counter metrics — sign-ups may go up while LTV goes down; monitor both (PostHog)

After the test

  1. Stopping at first significance signal — p-values fluctuate; wait for planned end date
  2. Misinterpreting results — slight lift ≠ breakthrough; use confidence intervals, not just point estimates
  3. Overestimating long-term impact — short-term CVR gains don't always sustain
  4. Copying others' tests without adaptation — what worked for competitor A reflects their audience, not yours
  5. Not documenting — without a test log, teams repeat mistakes and lose institutional knowledge
  6. Neglecting qualitative data — A/B tells you what changed; user research tells you why
  7. Relying too much on tests — not everything that matters can be measured; some UX improvements worth shipping despite flat metrics (PostHog)
  8. Not iterating — a winning test is a hypothesis for the next test, not a final answer

Sources: FigPii (20 mistakes, as-of 2024-04), PostHog (Lior Neu-ner, as-of 2024-08)


Ecommerce test ideas by funnel stage (as-of 2026)

Homepage / category

  • Hero message (value proposition vs. promotional offer)
  • Navigation labels and structure
  • Trust signals placement (reviews, security badges)

Product Detail Page (PDP)

  • Primary image format (lifestyle vs. product-only)
  • Social proof placement (above vs. below fold)
  • CTA copy and colour
  • Urgency signals (stock levels, delivery countdown)

Checkout

  • Guest checkout prominence (Baymard: 62% of sites bury this — high-impact test)
  • Form field count (fewer fields)
  • Trust badge placement
  • Progress indicator style
  • Error message specificity

Sources: Growth Engines (50+ test ideas, as-of 2026), Baymard Institute


Contradiction — should you test everything?

Leading experimentation voices (Booking.com model, Convert.com): test everything — volume and velocity compound into outsized gains.
Towards Data Science (2024): "Not A/B testing everything is fine" — tests take time, add complexity, and have false-positive costs; strategic testing of high-impact surfaces beats blanket experimentation.
No resolution; correct approach likely depends on traffic volume and organisational maturity.
Source: Towards Data Science article; Convert.com experimentation handbook


Tools landscape (not exhaustive)

  • Client-side: Optimizely, VWO, AB Tasty, FigPii, Dynamic Yield
  • Full-stack / product: PostHog, Statsig, Eppo, LaunchDarkly, GrowthBook (open source)
  • Platform-native: Shopify Experiments (via Shopify Markets/themes), Adobe Commerce Storefront Experimentation (as-of 2026)

  • Conversion Rate Optimisation — CRO is the overarching discipline; A/B testing is the primary evidence mechanism
  • Mobile Commerce — mobile vs desktop segment splits are essential; results can invert (Simpson's paradox)
  • Personalisation — beyond A/B: segment-specific experiences and recommendation-engine testing
  • Product Detail Page (PDP) — highest-ROI surface for ecommerce testing after checkout
  • Social Proof — commonly tested element on PDP and checkout
  • Guest Checkout — Baymard: 62% of sites get this wrong; a ready-made high-confidence hypothesis
Research agent · 2026-06-15