A deliverability incident is the worst category of operations problem because the damage compounds: every hour you continue sending into a reputation event extends the recovery period 1.5-3×. The senders who survive incidents are the ones with a clear runbook — pause first, diagnose second, fix third — not the ones who try to "send through" the problem.
This runbook covers the four most-common deliverability incidents, the AcelleMail commands + dashboard queries to use, and the post-mortem template.
Severity classification#
| Severity |
Symptom |
Time-to-respond |
| P0 — emergency |
Spamhaus or Barracuda listing; full sending block from a major ISP |
Within 30 minutes |
| P1 — high |
Gmail "Bad" or Microsoft RED; sustained > 0.50% complaint rate |
Within 2 hours |
| P2 — moderate |
Single-ISP throttling (yellow); single-campaign complaint spike |
Within 24 hours |
| P3 — low |
Reputation drop one tier (High → Medium); rising bounce rate |
Investigate next business day |
P0 — Spamhaus / Barracuda listing#
Triage (5 min)#
# Confirm the listing
for ip in YOUR.SENDING.IP.HERE; do
reversed=$(echo $ip | awk -F. '{print $4"."$3"."$2"."$1}')
for bl in zen.spamhaus.org b.barracudacentral.org; do
echo -n "$bl: "; dig +short "$reversed.$bl"
done
done
If zen.spamhaus.org returns 127.0.0.2, 127.0.0.3, or 127.0.0.4, you're listed. Action sequence:
1. Pause sending immediately (5 min)#
Admin → Sending Servers → <listed IP> → Edit → Status: Paused
This stops new dispatches. In-flight messages will retry per Laravel's queue — they'll fail at SMTP because the listing means receivers reject. Don't drain the queue manually; let the queue fail naturally.
2. Investigate cause (15-30 min)#
Why are you listed? Probable causes (in frequency order):
- Spam-trap hit — a known-trap address was on your list. Check trap-hit data in SNDS and Spamhaus's listing details.
- Compromised account — someone gained access to your AcelleMail and sent spam. Audit access logs.
- Weak unsubscribe handling — repeat-recipients of unwanted mail report you, eventually triggering volumetric flags.
- Compromised customer (multi-tenant) — one of your customers was sending bad mail. Pause the customer pending audit.
Check Spamhaus's SBL Lookup for the specific listing reason — they sometimes provide details.
3. Apply fix (1-4 hours)#
- Clean the data: suppress all addresses with bounce types
hard from the last 90 days. Suppress all addresses on any external suppression list (NeverBounce, ZeroBounce check). Consider a list-cleaning campaign.
- Audit access: check
users table for unfamiliar accounts. Force password reset on all admins.
- Strengthen unsubscribe: verify the unsub link works and is honored within 24 hours.
4. Submit delisting petition (15 min)#
Spamhaus: removal.spamhaus.org — describe the corrective actions you took. Be honest; Spamhaus has seen every excuse.
Approval: 6-72 hours typically. While waiting, don't send from the listed IP — even tentative resumes will reset the listing clock.
5. Resume cautiously (after delisting)#
After delisting, resume at 5-10% of pre-incident volume. Watch SNDS + Postmaster daily. Ramp 25% per day. Full restoration: 7-14 days.
P1 — Gmail Bad reputation#
Triage#
Open Postmaster Tools. If IP or Domain reputation shows "Bad":
1. Pause Gmail-segment sending (10 min)#
If you have multi-server rotation, pause the Gmail-segment server. Otherwise, segment by recipient domain and exclude *@gmail.com from new sends:
-- Verify scope of "Gmail" recipients (informational)
SELECT COUNT(DISTINCT subscriber_id)
FROM subscribers
WHERE email LIKE '%@gmail.com' OR email LIKE '%@googlemail.com';
2. Investigate (30-60 min)#
- Postmaster's "Spam Rate" graph — likely > 0.30%.
- Postmaster's "Authentication" graph — verify SPF/DKIM/DMARC are 100% pass.
- AcelleMail's per-campaign complaint rate — find the spiking campaign:
SELECT c.id, c.name, c.subject,
COUNT(tl.id) as sent,
SUM(CASE WHEN bl.bounce_type = 'unknown' THEN 1 ELSE 0 END) as complaints,
ROUND(SUM(CASE WHEN bl.bounce_type = 'unknown' THEN 1 ELSE 0 END) * 100.0 / COUNT(tl.id), 3) as compl_pct
FROM campaigns c
JOIN tracking_logs tl ON tl.campaign_id = c.id
LEFT JOIN bounce_logs bl ON bl.tracking_log_id = tl.id
WHERE c.created_at > NOW() - INTERVAL 14 DAY
GROUP BY c.id
HAVING compl_pct > 0.10
ORDER BY compl_pct DESC;
3. Apply fix (2-6 hours)#
- Suppress complaining addresses + low-engagement addresses.
- Send only to engaged-30-day Gmail segment for 7 days.
- If a specific campaign's content was the cause, audit subject lines and unsubscribe handling.
4. Recovery#
Reputation recovers in 7-21 days from Bad → Low → Medium → High. Don't rush volume increases — push past 50% of pre-incident volume only after Postmaster shows Medium for 5 consecutive days.
P1 — Microsoft RED#
Same pattern as Gmail Bad, but with SNDS as the diagnostic dashboard. The corrective is the same — pause Microsoft-segment sending, audit, suppress, recover. Microsoft's recovery is generally faster than Gmail's (a clean RED → GREEN often in 7-14 days).
P2 — Complaint-rate spike (single campaign)#
A single campaign with > 1% complaint rate is a content/audience problem, not a reputation event yet. Action:
- Suppress complainers — AcelleMail's FBL handler does this automatically if FBL is configured (per FBL setup).
- Audit the campaign — subject line, content, unsub-link visibility. Common cause: subject didn't match content (clickbait), or audience didn't expect this type of mail.
- Reduce frequency for that segment — don't send another campaign to the same audience for 7-14 days.
- Watch reputation — Postmaster + SNDS for 7 days. If reputation doesn't drop, the suppression worked.
This level of incident is recoverable without pause — but P1 protocols apply if reputation does drop in the following days.
Post-mortem template#
After every P0 or P1 incident, write a brief post-mortem (preserve in docs/incidents/YYYY-MM-DD-summary.md):
# Incident YYYY-MM-DD — <one-line summary>
## Severity
P0 / P1
## Timeline (UTC)
- HH:MM — first detection (alert source)
- HH:MM — pause issued
- HH:MM — root cause identified
- HH:MM — fix applied
- HH:MM — sending resumed at <X>% volume
- HH:MM — fully recovered
## Root cause
<what actually caused it — be specific>
## Impact
- Affected ISPs: <list>
- Estimated lost-deliverability volume: <number>
- Reputation recovery time: <days>
## What worked
<actions that helped>
## What didn't
<actions that didn't help or made it worse>
## Action items
- [ ] <preventive change 1>
- [ ] <preventive change 2>
- [ ] <monitoring/alerting addition>
These accumulate institutional knowledge. The fifth incident is dramatically easier than the first when there's a doc to refer to.
Related reading#
FAQ#
Can I outsource this?#
Some operators use deliverability consultants (e.g. SocketLabs, Mailgun's Inbox Pro) for incident response. Effective if you don't have someone on staff who can be on-call. Cost: $200-500/incident or retainer.
What if I can't determine the root cause?#
If 24 hours of investigation produces no clear cause, treat it as "audience cleaning" — suppress all subscribers with zero engagement in 90 days, send only to the engaged half for 14 days, then re-evaluate. Often the unknown cause was just accumulated list rot.
How do I prevent recurrence?#
Wire up the alerting in Sender Reputation Monitoring Stack so you catch the next incident at P3 instead of P0. Most P0 events were P3 events ignored 7-14 days earlier.
What about communicating with affected customers?#
If you're a multi-tenant AcelleMail provider, customers using the affected sending servers will see degraded deliverability. Email them within 4 hours of incident detection — saying "we're investigating" early is dramatically better than silence. Update at status-page cadence (every 2-4 hours during active incident).