A/B Report Deep Dive: Reading AcelleMail Test Results

After your A/B test finishes, the A/B Report tab tells you the winner — but the real value is in the variant-level numbers. This guide walks through the report, the statistical-significance rules, and what to do for winner / inconclusive / loss verdicts.

What this is for

Setting up the A/B test was the easy part (see A/B Test Campaigns). The hard part is reading the result correctly — knowing when "Variant A won by 23%" is signal vs noise, and what to test next based on the verdict.

This is the deep-dive on AcelleMail's A/B Report tab.

Where the report lives

After the test audience portion has sent and the wait period (default 72h) has passed, AcelleMail auto-evaluates the winner and shows the A/B Report tab on the campaign overview.

A/B Report tab on a sent A/B test campaign showing: winner banner "Variant A (A) is the winner! Click rate improved by 23.2% over the runner-up. Winner sent to remaining 80% of the audience." Stat cards: 1,176 Delivered / 39.7% Avg. Open Rate / 4% Avg. Click Rate / 20 Bounces / 0 Unsubscribes. Delivery Funnel: Delivered 1,176 → Opens 467 (39.7%) → Clicks 47 (4%). End-to-end Conversion 2.46% with Click/Open 10.06%, Bounces 1.67%, Unsubscribes 0%, Total Recipients 1,912.

(Screenshot from A/B Test Campaigns article.)

The report has 5 tabs of its own:

Tab What's there
A/B Report Winner banner + headline stats + delivery funnel
All variants Side-by-side per-variant numbers (Variant A vs B side-by-side)
Overall stats Combined numbers across all variants
Insights AcelleMail's plain-language read
Logs Per-recipient detail

The most decisional view is All variants — that's where you compare A vs B directly.

Reading the winner banner

The green winner banner is AcelleMail's verdict + the headline improvement. Three parts:

"Variant A (A) is the winner!
 Click rate improved by 23.2% over the runner-up.
 Winner sent to remaining 80% of the audience."
Phrase What to verify
"Variant A is the winner" Match the variant label (A/B/C…) to what you actually tested. Notes in the setup tab tell you which variant had which subject/from-name.
"improved by 23.2% over the runner-up" The percentage delta between winning and runner-up variants on the chosen metric. NOT the absolute rate.
"Winner sent to remaining 80%" Confirms the rollout happened automatically (or didn't, if "Manual selection" was picked).

Is the delta statistically significant?

A "23.2% improvement" sounds great until you realize the test audience was 50 subscribers. Then it's noise.

Rough thresholds for trust (no formal statistical test — just decision rules that work):

Test audience size per variant Trust delta if
<100 Don't trust any delta. Need bigger test.
100-500 Delta must be >15% AND consistent direction across opens + clicks
500-2,000 Delta >7% likely real
2,000-10,000 Delta >3% probably real
>10,000 Delta >1.5% reliably real

In the screenshot example: 1,176 delivered / 467 opens / 47 clicks. The test audience was ~600 per variant (1,176 total / 2 variants). At that size, 23.2% delta is reliable.

For a precise check, plug the numbers into any online A/B test calculator (search "A/B test significance calculator"). 95% confidence is the standard bar. Below 90% confidence, don't act on it.

The "All variants" tab — the data that matters

This is where you compare variants side-by-side. For each variant, you see:

Metric What it means
Delivered How many emails arrived (same-ish across variants since split was random)
Open rate % opened (caveat: Apple MPP inflates)
Click rate % clicked (the reliable metric)
Click-to-open rate (CTOR) % of openers who clicked — best content-quality signal
Bounces Should be similar across variants
Unsubscribes If one variant has 2x the unsubs, that variant alienated subscribers

The 4 patterns you'll see:

Pattern Reading
Variant A: higher opens AND higher clicks Clear winner. Subject + content both worked.
Variant A: higher opens, similar clicks Subject pulled more opens but content didn't capitalise. A is "better" if your KPI is opens; B is "as good" if your KPI is clicks.
Variant A: similar opens, higher clicks Subject was tie; B's content/CTA underperformed. The subject didn't matter; the content did.
Variant A: lower opens, higher clicks Counter-intuitive — A had a smaller, more engaged audience. Often happens with personalisation tests.

What to do per verdict

Winner (clear, statistically significant)

✅ Promote the winning pattern. Update your subject-line / CTA / from-name standards to match.

Next test: vary a DIFFERENT element. If you tested subject lines this time, test CTA copy or send time next.

Inconclusive (delta <5%, low confidence)

⚠️ Both variants performed similarly. Don't conclude anything — the test didn't have enough power.

Next test: increase the test audience size (from 20% to 40%), OR test a more dramatically different variation.

Loss (variant B significantly underperformed A)

✅ Also valuable! You learned what NOT to do. Cross-reference variant B's pattern against your "stuff that didn't work" notes for future reference.

Next test: test the next hypothesis — don't keep beating up variant B.

Surprise (a variant you thought would lose actually won)

⚠️ The most valuable result. Audit WHY:

  • Subject line surprised? Maybe length / formality / emoji preference shifted
  • From-name surprised? Maybe personal-name-vs-brand norms shifted
  • CTA surprised? Maybe your offer messaging needs an update

Surprises are where you learn the most about your audience.

Statistical-significance worked example

Setup:

  • Variant A subject: "5 things every marketer should do this quarter"
  • Variant B subject: "Q2 newsletter"
  • Audience: 10,000, 50/50 split
  • Metric: click rate

Result after 72h:

Delivered Opened Clicked CTR
Variant A 4,820 1,690 (35.1%) 145 (3.01%)
Variant B 4,810 1,500 (31.2%) 115 (2.39%)

CTR delta: (3.01% − 2.39%) / 2.39% = +25.9% relative improvement for A.

At ~4,800 per variant with 145 vs 115 clicks, this is well above the 95% confidence threshold (any decent calculator confirms ~99.5% confidence).

Verdict: ship Variant A as the winning subject pattern. Next test: hold the subject style; vary the CTA copy.

Common pitfalls

Pitfall What's wrong Fix
Calling a winner at 12 hours Most opens land in the first hour but late-opens skew CTR. Wait the configured period (24-72h). Patience. The wait period is calibrated.
Testing 4 things at once Subject + sender + content + send time — you don't know which moved the needle. One variable at a time.
Re-testing the same hypothesis 3 times in a row Diminishing returns; audience may also adapt. Pick a different lever per quarter.
Ignoring unsubscribe rate A variant with higher clicks AND higher unsubs is poisoning long-term LTV. Treat unsubscribe rate as a guard metric — never let the winner have 2x the unsubs.
Manual-selection bias When picking manual winner, you choose the variant you predicted, regardless of data. If using manual, document the prediction BEFORE seeing data; compare.
Testing on a small list (<2,000) No statistical power. Even big deltas are noise. Pause testing on small lists; rely on industry benchmarks or test pooled across campaigns.

What to log per test (a running playbook)

Maintain a doc tracking every A/B test. Each row:

Field Example
Date 2026-05-19
Campaign Spring Sale 2026
Variable tested Subject line length
Variant A "5 things every marketer..." (45 chars)
Variant B "Q2 newsletter" (13 chars)
Winner metric Click rate
Winner A (3.01% vs 2.39%, +25.9%)
Confidence ~99.5%
Decision applied Use long-descriptive subjects going forward
Re-test in Q3 2026

After 10-20 tests, patterns emerge — "our audience consistently prefers question-format subjects on Tuesdays." That's the real long-term value.

Related articles

9 条评论

3 条评论

  1. ravi.kumar.del…
    we had to add custom UTM parameters to get the cross-campaign attribution we wanted — the defaults weren't quite enough.
    1. admin (已编辑)
      thanks for sharing. the pattern you describe is exactly the use case we built that feature for — glad it landed for you
  2. joel.anders.se
    Can the analytics be exported to a data warehouse? We feed everything into BigQuery for cross-channel reporting
    1. admin
      Same answer as above for SaaS-tenant — works the same way per-tenant, with the caveat that the cron must be set per-customer (not just system-wide).
    2. admin (已编辑)
      Good question. The campaign:rerun audit writes to laravel.log only when the audit decides to force-resume — pure noop runs are silent. We'll add an info-level heartbeat in a future Acelle release to make it easier to monitor.
    3. admin (已编辑)
      There's no built-in way today. Two workarounds: (1) cron + custom script polling the API every N minutes, (2) webhook-driven if your event source supports it. Most operators go with #2.
  3. linhpm.devs
    The funnel-attribution model explanation is the clearest I've read. The 'last-touch vs first-touch' framing especially.
    1. admin (已编辑)
      Thanks. Pass it along if it helps your team.
    2. admin (已编辑)
      Thanks for the kind words. We try to keep these source-grounded so they age well.

More in Analytics & Reporting