Analytics & Reporting

A/B Report Deep Dive: Reading AcelleMail Test Results

After your A/B test finishes, the A/B Report tab tells you the winner — but the real value is in the variant-level numbers. This guide walks through the report, the statistical-significance rules, and what to do for winner / inconclusive / loss verdicts.

December 4, 2025 6 min read Advanced

What this is for

Setting up the A/B test was the easy part (see A/B Test Campaigns). The hard part is reading the result correctly — knowing when "Variant A won by 23%" is signal vs noise, and what to test next based on the verdict.

This is the deep-dive on AcelleMail's A/B Report tab.

Where the report lives

After the test audience portion has sent and the wait period (default 72h) has passed, AcelleMail auto-evaluates the winner and shows the A/B Report tab on the campaign overview.

(Screenshot from A/B Test Campaigns article.)

The report has 5 tabs of its own:

Tab	What's there
A/B Report	Winner banner + headline stats + delivery funnel
All variants	Side-by-side per-variant numbers (Variant A vs B side-by-side)
Overall stats	Combined numbers across all variants
Insights	AcelleMail's plain-language read
Logs	Per-recipient detail

The most decisional view is All variants — that's where you compare A vs B directly.

Reading the winner banner

The green winner banner is AcelleMail's verdict + the headline improvement. Three parts:

"Variant A (A) is the winner!
 Click rate improved by 23.2% over the runner-up.
 Winner sent to remaining 80% of the audience."

Phrase	What to verify
"Variant A is the winner"	Match the variant label (A/B/C…) to what you actually tested. Notes in the setup tab tell you which variant had which subject/from-name.
"improved by 23.2% over the runner-up"	The percentage delta between winning and runner-up variants on the chosen metric. NOT the absolute rate.
"Winner sent to remaining 80%"	Confirms the rollout happened automatically (or didn't, if "Manual selection" was picked).

Is the delta statistically significant?

A "23.2% improvement" sounds great until you realize the test audience was 50 subscribers. Then it's noise.

Rough thresholds for trust (no formal statistical test — just decision rules that work):

Test audience size per variant	Trust delta if
<100	Don't trust any delta. Need bigger test.
100-500	Delta must be >15% AND consistent direction across opens + clicks
500-2,000	Delta >7% likely real
2,000-10,000	Delta >3% probably real
>10,000	Delta >1.5% reliably real

In the screenshot example: 1,176 delivered / 467 opens / 47 clicks. The test audience was ~600 per variant (1,176 total / 2 variants). At that size, 23.2% delta is reliable.

For a precise check, plug the numbers into any online A/B test calculator (search "A/B test significance calculator"). 95% confidence is the standard bar. Below 90% confidence, don't act on it.

The "All variants" tab — the data that matters

This is where you compare variants side-by-side. For each variant, you see:

Metric	What it means
Delivered	How many emails arrived (same-ish across variants since split was random)
Open rate	% opened (caveat: Apple MPP inflates)
Click rate	% clicked (the reliable metric)
Click-to-open rate (CTOR)	% of openers who clicked — best content-quality signal
Bounces	Should be similar across variants
Unsubscribes	If one variant has 2x the unsubs, that variant alienated subscribers

The 4 patterns you'll see:

Pattern	Reading
Variant A: higher opens AND higher clicks	Clear winner. Subject + content both worked.
Variant A: higher opens, similar clicks	Subject pulled more opens but content didn't capitalise. A is "better" if your KPI is opens; B is "as good" if your KPI is clicks.
Variant A: similar opens, higher clicks	Subject was tie; B's content/CTA underperformed. The subject didn't matter; the content did.
Variant A: lower opens, higher clicks	Counter-intuitive — A had a smaller, more engaged audience. Often happens with personalisation tests.

What to do per verdict

Winner (clear, statistically significant)

✅ Promote the winning pattern. Update your subject-line / CTA / from-name standards to match.

Next test: vary a DIFFERENT element. If you tested subject lines this time, test CTA copy or send time next.

Inconclusive (delta <5%, low confidence)

⚠️ Both variants performed similarly. Don't conclude anything — the test didn't have enough power.

Next test: increase the test audience size (from 20% to 40%), OR test a more dramatically different variation.

Loss (variant B significantly underperformed A)

✅ Also valuable! You learned what NOT to do. Cross-reference variant B's pattern against your "stuff that didn't work" notes for future reference.

Next test: test the next hypothesis — don't keep beating up variant B.

Surprise (a variant you thought would lose actually won)

⚠️ The most valuable result. Audit WHY:

Subject line surprised? Maybe length / formality / emoji preference shifted
From-name surprised? Maybe personal-name-vs-brand norms shifted
CTA surprised? Maybe your offer messaging needs an update

Surprises are where you learn the most about your audience.

Statistical-significance worked example

Setup:

Variant A subject: "5 things every marketer should do this quarter"
Variant B subject: "Q2 newsletter"
Audience: 10,000, 50/50 split
Metric: click rate

Result after 72h:

	Delivered	Opened	Clicked	CTR
Variant A	4,820	1,690 (35.1%)	145 (3.01%)	—
Variant B	4,810	1,500 (31.2%)	115 (2.39%)	—

CTR delta: (3.01% − 2.39%) / 2.39% = +25.9% relative improvement for A.

At ~4,800 per variant with 145 vs 115 clicks, this is well above the 95% confidence threshold (any decent calculator confirms ~99.5% confidence).

Verdict: ship Variant A as the winning subject pattern. Next test: hold the subject style; vary the CTA copy.

Common pitfalls

Pitfall	What's wrong	Fix
Calling a winner at 12 hours	Most opens land in the first hour but late-opens skew CTR. Wait the configured period (24-72h).	Patience. The wait period is calibrated.
Testing 4 things at once	Subject + sender + content + send time — you don't know which moved the needle.	One variable at a time.
Re-testing the same hypothesis 3 times in a row	Diminishing returns; audience may also adapt.	Pick a different lever per quarter.
Ignoring unsubscribe rate	A variant with higher clicks AND higher unsubs is poisoning long-term LTV.	Treat unsubscribe rate as a guard metric — never let the winner have 2x the unsubs.
Manual-selection bias	When picking manual winner, you choose the variant you predicted, regardless of data.	If using manual, document the prediction BEFORE seeing data; compare.
Testing on a small list (<2,000)	No statistical power. Even big deltas are noise.	Pause testing on small lists; rely on industry benchmarks or test pooled across campaigns.

What to log per test (a running playbook)

Maintain a doc tracking every A/B test. Each row:

Field	Example
Date	2026-05-19
Campaign	Spring Sale 2026
Variable tested	Subject line length
Variant A	"5 things every marketer..." (45 chars)
Variant B	"Q2 newsletter" (13 chars)
Winner metric	Click rate
Winner	A (3.01% vs 2.39%, +25.9%)
Confidence	~99.5%
Decision applied	Use long-descriptive subjects going forward
Re-test in	Q3 2026

After 10-20 tests, patterns emerge — "our audience consistently prefers question-format subjects on Tuesdays." That's the real long-term value.

Tagged

Acellemail

Đăng nhập để thích 3 9 bình luận

3 bình luận

Tham gia thảo luận. Phần bình luận mở cho thành viên cộng đồng AcelleMail.

Đăng ký chỉ mất khoảng 10 giây — không cần xác minh email.

Tạo tài khoản Đăng nhập

ravi.kumar.del… 3 tháng trước

we had to add custom UTM parameters to get the cross-campaign attribution we wanted — the defaults weren't quite enough.

0
1. admin 2 tháng trước (đã chỉnh sửa)
  
  thanks for sharing. the pattern you describe is exactly the use case we built that feature for — glad it landed for you
  
  0
joel.anders.se 3 tháng trước

Can the analytics be exported to a data warehouse? We feed everything into BigQuery for cross-channel reporting

0
1. admin 3 tháng trước
  
  Same answer as above for SaaS-tenant — works the same way per-tenant, with the caveat that the cron must be set per-customer (not just system-wide).
  
  0
2. admin 3 tháng trước (đã chỉnh sửa)
  
  Good question. The campaign:rerun audit writes to laravel.log only when the audit decides to force-resume — pure noop runs are silent. We'll add an info-level heartbeat in a future Acelle release to make it easier to monitor.
  
  0
3. admin 1 tháng trước (đã chỉnh sửa)
  
  There's no built-in way today. Two workarounds: (1) cron + custom script polling the API every N minutes, (2) webhook-driven if your event source supports it. Most operators go with #2.
  
  0
linhpm.devs 3 tháng trước

The funnel-attribution model explanation is the clearest I've read. The 'last-touch vs first-touch' framing especially.

0
1. admin 2 tháng trước (đã chỉnh sửa)
  
  Thanks. Pass it along if it helps your team.
  
  0
2. admin 1 tháng trước (đã chỉnh sửa)
  
  Thanks for the kind words. We try to keep these source-grounded so they age well.
  
  0

Analytics & Reporting

Email Campaign ROI: Math + AcelleMail Data Points

Email ROI = (Revenue − Cost) / Cost. This guide shows the math, which AcelleMail data points feed each variable, the costs people forget to...

6 min read Intermediate

6 13

Analytics & Reporting

Read the Campaign Report: AcelleMail Analytics Tour

AcelleMail's campaign report has 6 tabs (Overview / Insights / Links / Map / Sending logs / Email review) and 5 headline metrics. This walkt...

8 min read Beginner

6 10

Analytics & Reporting

Open Rates + Apple MPP: What AcelleMail's Numbers Actually Mean

Apple Mail Privacy Protection inflates 30-45% of your "opens" by pre-fetching pixels. This guide shows how to spot MPP inflation in AcelleMa...

7 min read Intermediate

9 11

A/B Report Deep Dive: Reading AcelleMail Test Results

What this is for

Where the report lives

Reading the winner banner

Is the delta statistically significant?

The "All variants" tab — the data that matters

What to do per verdict

Winner (clear, statistically significant)

Inconclusive (delta <5%, low confidence)

Loss (variant B significantly underperformed A)

Surprise (a variant you thought would lose actually won)

Statistical-significance worked example

Common pitfalls

What to log per test (a running playbook)

Related articles

3 bình luận

Email Campaign ROI: Math + AcelleMail Data Points

Read the Campaign Report: AcelleMail Analytics Tour

Open Rates + Apple MPP: What AcelleMail's Numbers Actually Mean

More in Analytics & Reporting

Read the Campaign Report: AcelleMail Analytics Tour

Click Analysis in AcelleMail: Links Tab + Geographic Click Map

UTM Parameters in AcelleMail — Tracking Clicks to Your Analytics Tool

Email Campaign ROI: Math + AcelleMail Data Points

Vận hành email marketing trên server của bạn, theo điều kiện của bạn

What this is for#

Where the report lives#

Reading the winner banner#

Is the delta statistically significant?#

The "All variants" tab — the data that matters#

What to do per verdict#

Winner (clear, statistically significant)#

Inconclusive (delta <5%, low confidence)#

Loss (variant B significantly underperformed A)#

Surprise (a variant you thought would lose actually won)#

Statistical-significance worked example#

Common pitfalls#

What to log per test (a running playbook)#

Related articles#

Get more guides like this

Related reading

Email Campaign ROI: Math + AcelleMail Data Points

Read the Campaign Report: AcelleMail Analytics Tour

Open Rates + Apple MPP: What AcelleMail's Numbers Actually Mean

More in Analytics & Reporting

Read the Campaign Report: AcelleMail Analytics Tour

Click Analysis in AcelleMail: Links Tab + Geographic Click Map

UTM Parameters in AcelleMail — Tracking Clicks to Your Analytics Tool

Email Campaign ROI: Math + AcelleMail Data Points

Vận hành email marketing trên server của bạn, theo điều kiện của bạn

Get the AcelleMail newsletter

What this is for

Where the report lives

Reading the winner banner

Is the delta statistically significant?

The "All variants" tab — the data that matters

What to do per verdict

Winner (clear, statistically significant)

Inconclusive (delta <5%, low confidence)

Loss (variant B significantly underperformed A)

Surprise (a variant you thought would lose actually won)

Statistical-significance worked example

Common pitfalls

What to log per test (a running playbook)

Related articles