Self-hosted email marketing with full source code. Pay once, own forever. Get AcelleMail — $74 →

Deliverability Incident Runbook

A step-by-step runbook for the four most-common deliverability incidents — Spamhaus listing, Gmail Bad reputation, Microsoft RED, and complaint-rate spike. Triage, fix, and post-mortem.

A deliverability incident is the worst category of operations problem because the damage compounds: every hour you continue sending into a reputation event extends the recovery period 1.5-3×. The senders who survive incidents are the ones with a clear runbook — pause first, diagnose second, fix third — not the ones who try to "send through" the problem.

This runbook covers the four most-common deliverability incidents, the AcelleMail commands + dashboard queries to use, and the post-mortem template.

Severity classification

Severity Symptom Time-to-respond
P0 — emergency Spamhaus or Barracuda listing; full sending block from a major ISP Within 30 minutes
P1 — high Gmail "Bad" or Microsoft RED; sustained > 0.50% complaint rate Within 2 hours
P2 — moderate Single-ISP throttling (yellow); single-campaign complaint spike Within 24 hours
P3 — low Reputation drop one tier (High → Medium); rising bounce rate Investigate next business day

P0 — Spamhaus / Barracuda listing

Triage (5 min)

# Confirm the listing
for ip in YOUR.SENDING.IP.HERE; do
  reversed=$(echo $ip | awk -F. '{print $4"."$3"."$2"."$1}')
  for bl in zen.spamhaus.org b.barracudacentral.org; do
    echo -n "$bl: "; dig +short "$reversed.$bl"
  done
done

If zen.spamhaus.org returns 127.0.0.2, 127.0.0.3, or 127.0.0.4, you're listed. Action sequence:

1. Pause sending immediately (5 min)

Admin → Sending Servers → <listed IP> → Edit → Status: Paused

This stops new dispatches. In-flight messages will retry per Laravel's queue — they'll fail at SMTP because the listing means receivers reject. Don't drain the queue manually; let the queue fail naturally.

2. Investigate cause (15-30 min)

Why are you listed? Probable causes (in frequency order):

  • Spam-trap hit — a known-trap address was on your list. Check trap-hit data in SNDS and Spamhaus's listing details.
  • Compromised account — someone gained access to your AcelleMail and sent spam. Audit access logs.
  • Weak unsubscribe handling — repeat-recipients of unwanted mail report you, eventually triggering volumetric flags.
  • Compromised customer (multi-tenant) — one of your customers was sending bad mail. Pause the customer pending audit.

Check Spamhaus's SBL Lookup for the specific listing reason — they sometimes provide details.

3. Apply fix (1-4 hours)

  • Clean the data: suppress all addresses with bounce types hard from the last 90 days. Suppress all addresses on any external suppression list (NeverBounce, ZeroBounce check). Consider a list-cleaning campaign.
  • Audit access: check users table for unfamiliar accounts. Force password reset on all admins.
  • Strengthen unsubscribe: verify the unsub link works and is honored within 24 hours.

4. Submit delisting petition (15 min)

Spamhaus: removal.spamhaus.org — describe the corrective actions you took. Be honest; Spamhaus has seen every excuse.

Approval: 6-72 hours typically. While waiting, don't send from the listed IP — even tentative resumes will reset the listing clock.

5. Resume cautiously (after delisting)

After delisting, resume at 5-10% of pre-incident volume. Watch SNDS + Postmaster daily. Ramp 25% per day. Full restoration: 7-14 days.

P1 — Gmail Bad reputation

Triage

Open Postmaster Tools. If IP or Domain reputation shows "Bad":

1. Pause Gmail-segment sending (10 min)

If you have multi-server rotation, pause the Gmail-segment server. Otherwise, segment by recipient domain and exclude *@gmail.com from new sends:

-- Verify scope of "Gmail" recipients (informational)
SELECT COUNT(DISTINCT subscriber_id)
FROM subscribers
WHERE email LIKE '%@gmail.com' OR email LIKE '%@googlemail.com';

2. Investigate (30-60 min)

  • Postmaster's "Spam Rate" graph — likely > 0.30%.
  • Postmaster's "Authentication" graph — verify SPF/DKIM/DMARC are 100% pass.
  • AcelleMail's per-campaign complaint rate — find the spiking campaign:
SELECT c.id, c.name, c.subject,
  COUNT(tl.id) as sent,
  SUM(CASE WHEN bl.bounce_type = 'unknown' THEN 1 ELSE 0 END) as complaints,
  ROUND(SUM(CASE WHEN bl.bounce_type = 'unknown' THEN 1 ELSE 0 END) * 100.0 / COUNT(tl.id), 3) as compl_pct
FROM campaigns c
JOIN tracking_logs tl ON tl.campaign_id = c.id
LEFT JOIN bounce_logs bl ON bl.tracking_log_id = tl.id
WHERE c.created_at > NOW() - INTERVAL 14 DAY
GROUP BY c.id
HAVING compl_pct > 0.10
ORDER BY compl_pct DESC;

3. Apply fix (2-6 hours)

  • Suppress complaining addresses + low-engagement addresses.
  • Send only to engaged-30-day Gmail segment for 7 days.
  • If a specific campaign's content was the cause, audit subject lines and unsubscribe handling.

4. Recovery

Reputation recovers in 7-21 days from Bad → Low → Medium → High. Don't rush volume increases — push past 50% of pre-incident volume only after Postmaster shows Medium for 5 consecutive days.

P1 — Microsoft RED

Same pattern as Gmail Bad, but with SNDS as the diagnostic dashboard. The corrective is the same — pause Microsoft-segment sending, audit, suppress, recover. Microsoft's recovery is generally faster than Gmail's (a clean RED → GREEN often in 7-14 days).

P2 — Complaint-rate spike (single campaign)

A single campaign with > 1% complaint rate is a content/audience problem, not a reputation event yet. Action:

  1. Suppress complainers — AcelleMail's FBL handler does this automatically if FBL is configured (per FBL setup).
  2. Audit the campaign — subject line, content, unsub-link visibility. Common cause: subject didn't match content (clickbait), or audience didn't expect this type of mail.
  3. Reduce frequency for that segment — don't send another campaign to the same audience for 7-14 days.
  4. Watch reputation — Postmaster + SNDS for 7 days. If reputation doesn't drop, the suppression worked.

This level of incident is recoverable without pause — but P1 protocols apply if reputation does drop in the following days.

Post-mortem template

After every P0 or P1 incident, write a brief post-mortem (preserve in docs/incidents/YYYY-MM-DD-summary.md):

# Incident YYYY-MM-DD — <one-line summary>

## Severity
P0 / P1

## Timeline (UTC)
- HH:MM — first detection (alert source)
- HH:MM — pause issued
- HH:MM — root cause identified
- HH:MM — fix applied
- HH:MM — sending resumed at <X>% volume
- HH:MM — fully recovered

## Root cause
<what actually caused it — be specific>

## Impact
- Affected ISPs: <list>
- Estimated lost-deliverability volume: <number>
- Reputation recovery time: <days>

## What worked
<actions that helped>

## What didn't
<actions that didn't help or made it worse>

## Action items
- [ ] <preventive change 1>
- [ ] <preventive change 2>
- [ ] <monitoring/alerting addition>

These accumulate institutional knowledge. The fifth incident is dramatically easier than the first when there's a doc to refer to.

Related reading

FAQ

Can I outsource this?

Some operators use deliverability consultants (e.g. SocketLabs, Mailgun's Inbox Pro) for incident response. Effective if you don't have someone on staff who can be on-call. Cost: $200-500/incident or retainer.

What if I can't determine the root cause?

If 24 hours of investigation produces no clear cause, treat it as "audience cleaning" — suppress all subscribers with zero engagement in 90 days, send only to the engaged half for 14 days, then re-evaluate. Often the unknown cause was just accumulated list rot.

How do I prevent recurrence?

Wire up the alerting in Sender Reputation Monitoring Stack so you catch the next incident at P3 instead of P0. Most P0 events were P3 events ignored 7-14 days earlier.

What about communicating with affected customers?

If you're a multi-tenant AcelleMail provider, customers using the affected sending servers will see degraded deliverability. Email them within 4 hours of incident detection — saying "we're investigating" early is dramatically better than silence. Update at status-page cadence (every 2-4 hours during active incident).

More in Sending & Deliverability