On this page
- Statistical frameworks
- Frequentist (null hypothesis significance testing)
- Bayesian
- Contradiction — "Bayesian imposter" risk
- Sample size and traffic requirements (as-of 2025)
- Experiment maturity levels (as-of 2024)
- Signs of a mature programme
- Compounding wins > single breakthroughs
- Common programme failure: wrong tool order
- 20 critical mistakes (consolidated)
- Before the test
- During the test
- After the test
- Ecommerce test ideas by funnel stage (as-of 2026)
- Homepage / category
- Product Detail Page (PDP)
- Checkout
- Contradiction — should you test everything?
- Tools landscape (not exhaustive)
- Related concepts
A/B Testing
A/B Testing
An A/B test (also called a split test or online controlled experiment) randomly assigns visitors to a control (A) or variant (B) experience and measures the difference in a target metric. The goal is to attribute observed differences to the change rather than to chance or external factors.
Statistical frameworks
Frequentist (null hypothesis significance testing)
- Sets a predetermined sample size based on statistical power calculations before the test runs
- Reports a p-value: probability of observing the result (or more extreme) if no true effect exists
- Standard threshold: p < 0.05 (95% confidence); p < 0.01 (99%) for high-stakes changes
- Avoids peeking — checking results mid-test inflates false positive rates dramatically
- More defensible in organisations where results will be challenged by sceptics
- Risk: "Peeking problem" — 43% of programmes end tests early, which invalidates results (Optimizely, 127k experiment study, as-of 2024)
Bayesian
- Starts with a prior belief (e.g., historical conversion rate) and updates it as data arrives
- Reports probability that variant beats control (e.g., "87% chance B is better") — more intuitive than p-values
- Allows continuous monitoring without the same multiple-comparison inflation as frequentist peeking
- Better suited to high-velocity environments or when data is limited and prior knowledge is useful
- Industry is gradually shifting toward Bayesian frameworks for speed and flexibility (Convert.com, as-of 2025)
Contradiction — "Bayesian imposter" risk
Eppo blog (2024): Some platforms marketed as Bayesian actually run frequentist analysis under the hood. When not implemented carefully, a Bayesian-labelled tool can give exactly the same false-positive risks as frequentist peeking.
CXL (Alex Birkett, updated 2025): For most ecommerce practitioners, "does it even matter?" — discipline (correct duration, no peeking, pre-registered hypothesis) matters more than the statistical framework chosen.
Source: Eppo — Beware of the Bayesian Imposter; CXL — Bayesian vs Frequentist
Sample size and traffic requirements (as-of 2025)
| Scenario | Visitors per variant required |
|---|---|
| 2% baseline CVR, detect 10% relative lift (95% confidence) | ~23,200 |
| 2% baseline CVR, detect 15% relative lift (95% confidence) | ~50,000 |
| Rule of thumb — reliable ecommerce test | ≥30,000 visitors per variation |
| Minimum viable programme | ≥10,000 monthly visitors or ≥200 monthly transactions |
Key levers:
- MDE (Minimum Detectable Effect): smaller expected lift → exponentially larger sample needed
- Baseline CVR: lower baseline → higher variance → more traffic needed
- Fashion-specific: average CVR 2–5% (desktop and mobile); use the lower end for conservative planning
- Small-traffic stores: only test big, bold changes with expected lift ≥30%; don't test button colours
Significance thresholds:
- 95% (p < 0.05) — standard for most ecommerce tests
- 99% (p < 0.01) — high-stakes changes (pricing, payment methods, checkout restructures)
Sources: GuessTheTest, Nudgenow, IronLinx, Digital Authority Partners (as-of 2025)
Experiment maturity levels (as-of 2024)
From Optimizely's 127,000-experiment dataset:
- Only 1 in 10 companies reach transformative experimentation maturity
- 33% of companies at beginner level have tested for less than a year
- 67% don't know how long they've been testing — experimentation hasn't been formalised
- Companies with mature programmes generate 30–50% higher revenue growth than peers relying on traditional decision-making (as-of 2024)
Signs of a mature programme
- Executive sponsorship and cross-team collaboration
- Shared test results across functions (marketing, product, UX, engineering)
- Bi-weekly or monthly experiment review meetings with mixed stakeholders
- A documented test log and knowledge base
- Pre-registration of hypotheses and duration before launch
Compounding wins > single breakthroughs
In one documented ecommerce case, no individual test produced more than a 14% lift — but 30+ incremental wins stacked to nearly 50% cumulative improvement. Velocity matters more than the size of any single bet. (source: search results, attributable to Optimizely/ExperimentFlow data, as-of 2024)
Common programme failure: wrong tool order
The most common mistake is over-investing in an enterprise testing platform before building the testing muscle to use it effectively. Start with a tool matched to current volume and maturity; upgrade as the programme scales. (Zigpoll / Kameleoon, as-of 2025)
20 critical mistakes (consolidated)
Before the test
- No hypothesis — testing without stating what, why, and what outcome is expected
- Wrong page or feature — testing low-traffic pages that can never reach significance
- Wrong metrics — measuring page views on a checkout test instead of cart completion
- Ignoring segmentation — not splitting results by device, channel, or user tenure before concluding
During the test
- Including unaffected users — dilutes results; filter ineligible users before assignment (PostHog)
- Peeking / stopping early — the peeking problem inflates false positives; pre-set duration and don't check mid-run
- Simpson's paradox — aggregate results can flip when broken into subgroups (e.g., mobile vs desktop). Always segment (PostHog)
- No A/A test — validate the tool works before trusting experiment data
- Ignoring mobile traffic — mobile often majority of traffic but gets collapsed into desktop results
- External factors — holiday sales, campaigns, algorithm changes can masquerade as treatment effects
- No predetermined duration — without it you can't distinguish intermediate from final results
- Skipping counter metrics — sign-ups may go up while LTV goes down; monitor both (PostHog)
After the test
- Stopping at first significance signal — p-values fluctuate; wait for planned end date
- Misinterpreting results — slight lift ≠ breakthrough; use confidence intervals, not just point estimates
- Overestimating long-term impact — short-term CVR gains don't always sustain
- Copying others' tests without adaptation — what worked for competitor A reflects their audience, not yours
- Not documenting — without a test log, teams repeat mistakes and lose institutional knowledge
- Neglecting qualitative data — A/B tells you what changed; user research tells you why
- Relying too much on tests — not everything that matters can be measured; some UX improvements worth shipping despite flat metrics (PostHog)
- Not iterating — a winning test is a hypothesis for the next test, not a final answer
Sources: FigPii (20 mistakes, as-of 2024-04), PostHog (Lior Neu-ner, as-of 2024-08)
Ecommerce test ideas by funnel stage (as-of 2026)
Homepage / category
- Hero message (value proposition vs. promotional offer)
- Navigation labels and structure
- Trust signals placement (reviews, security badges)
Product Detail Page (PDP)
- Primary image format (lifestyle vs. product-only)
- Social proof placement (above vs. below fold)
- CTA copy and colour
- Urgency signals (stock levels, delivery countdown)
Checkout
- Guest checkout prominence (Baymard: 62% of sites bury this — high-impact test)
- Form field count (fewer fields)
- Trust badge placement
- Progress indicator style
- Error message specificity
Sources: Growth Engines (50+ test ideas, as-of 2026), Baymard Institute
Contradiction — should you test everything?
Leading experimentation voices (Booking.com model, Convert.com): test everything — volume and velocity compound into outsized gains.
Towards Data Science (2024): "Not A/B testing everything is fine" — tests take time, add complexity, and have false-positive costs; strategic testing of high-impact surfaces beats blanket experimentation.
No resolution; correct approach likely depends on traffic volume and organisational maturity.
Source: Towards Data Science article; Convert.com experimentation handbook
Tools landscape (not exhaustive)
- Client-side: Optimizely, VWO, AB Tasty, FigPii, Dynamic Yield
- Full-stack / product: PostHog, Statsig, Eppo, LaunchDarkly, GrowthBook (open source)
- Platform-native: Shopify Experiments (via Shopify Markets/themes), Adobe Commerce Storefront Experimentation (as-of 2026)
Related concepts
- Conversion Rate Optimisation — CRO is the overarching discipline; A/B testing is the primary evidence mechanism
- Mobile Commerce — mobile vs desktop segment splits are essential; results can invert (Simpson's paradox)
- Personalisation — beyond A/B: segment-specific experiences and recommendation-engine testing
- Product Detail Page (PDP) — highest-ROI surface for ecommerce testing after checkout
- Social Proof — commonly tested element on PDP and checkout
- Guest Checkout — Baymard: 62% of sites get this wrong; a ready-made high-confidence hypothesis