Deliverability Incident Runbook — From Alert to Recovery in AcelleMail

Bounce rate spiked. Complaint volume jumped. A campaign is queueing and not sending. This is the step-by-step runbook for AcelleMail operators when something breaks — UI checks first, server-side diagnostics last.

The 5-minute triage flow

When something looks broken (campaign stuck / bounce rate spike / customer complaint), don't jump to SSH. Walk this exact flow in the AcelleMail dashboard first. Most incidents resolve here.

Step 1: Read the tracking log

Campaign → Tracking log tab:

Tracking log — per-message audit

What the rows tell you:

  • All recent timestamps + "Sent" status — AcelleMail handed messages to the SMTP successfully. Issue is recipient-side.
  • Recent timestamps + "Failed" + retry counter — receiving servers rejecting. Open bounce log next.
  • No new rows in last 30 min — queue worker is dead OR campaign is paused. Check campaign state + worker (see Advanced).

Step 2: Read the bounce log

Where the bounce signals live

Open the campaign → Bounce log tab. Every failed delivery is row-listed with the recipient, bounce type (Hard/Soft chip), and the raw reason from the receiving server:

Campaign bounce log — hard + soft bounces with DSN reason

For machine-codes (5.x.x / 4.x.x DSN classifications) the per-row reason text is the receiving server's verbatim response. See Decoding bounce messages for the full code reference.

Pattern-match the bounces:

What you see Diagnosis
Mostly 5.1.1 "User unknown" Stale list — addresses no longer exist. Normal at <5%, problem above.
Mostly 5.7.x "Delivery not authorized" Authentication broken (SPF/DKIM/DMARC) OR IP blocklisted
Mostly 4.x.x "Mailbox busy / try again" Receiving server overloaded — AcelleMail retries automatically
ALL bounces to one specific domain That receiver blocking you specifically — investigate FBL signals
Mix of 5.x + tracking log shows "Pending" rows piling up Queue worker not picking up — worker process down

Step 3: Check the feedback log (complaints)

Same campaign → Feedback log tab:

Campaign feedback log — complaints

Complaint volume tells you:

  • Spike + concentrated on one segment — that segment's consent was unclear or content was off-target. Pause sending to it; audit.
  • Spread across the list — content / subject mismatch. Audit the campaign before next send.
  • Zero complaints despite bounce spike — pure deliverability issue (auth/IP), not content/list.

Step 4: Check sending server health

Open Settings → Sending servers → click your active server:

Sending server config with SPF/DKIM/DMARC chips

Verify all three: SPF green, DKIM green, DMARC green. Any red = receiving servers immediately throttling. Click Verify domain to walk through DNS records.

Step 5: Run the live diagnostic

Same sending server → toolbar Send test email:

Test send modal

Send to your own personal Gmail, Outlook, Yahoo. Open each:

  • All Inbox → reputation is OK; original issue is content-side or list-side
  • Some spam folder → ISP-specific reputation hit (Gmail or Outlook only). Check that ISP's postmaster tool.
  • None arrive → IP blocklisted. Use mxtoolbox.com to identify which blocklist; follow delisting process.

Triage decision tree

Signal Action
Tracking log empty + campaign running >30min Queue worker dead (escalate to operator)
5.7.x bounces + DKIM red Re-run Verify domain wizard at sending-server config
4.x.x bounces + nothing else off Wait — AcelleMail auto-retries within 24h
5.1.1 bounces >10% of recipients List import error — pause; run email verification on full list
Complaint spike >0.3% PAUSE remaining sends; audit consent + content; do NOT resume to full list
Test send arrives in Spam at Gmail Gmail-specific. Check Postmaster Tools.
Test send doesn't arrive at all IP blocklisted. Check mxtoolbox.com
All to one domain fail (e.g. all yahoo.com) Domain-specific block — pause sending to that domain; investigate via Y!Mail postmaster signals

Recovery checklist

After identifying root cause:

  1. Pause active sending if customer-visible impact (campaign log not draining, complaints spike)
  2. Fix the source — DNS record, list cleanup, content edit
  3. Send a small re-engagement to your most-engaged 5% to validate (NOT full list)
  4. Monitor for 24h — bounce + complaint rates back to baseline?
  5. Resume normal sending if all green
  6. Post-mortem within 48h (template below)

Post-mortem template

For each incident, capture in a shared doc:

INCIDENT: [Date] — [Short description]
DETECTED VIA: [Customer complaint / auto-alert / routine check]
DETECTED AT: [Timestamp]
SEVERITY: [Customer-facing impact: stalled campaigns / damaged reputation / complaint spike]

ROOT CAUSE: [One sentence — what actually broke]

TIMELINE:
  T+0:   [What happened]
  T+5m:  [What we did]
  T+30m: [What we did]
  T+...: [Resolution]

CONTRIBUTING FACTORS: [What made detection slow / response slow / impact wider]

PREVENTION: [What we change to make this not happen again — explicit action items with owners + dates]

LEARNINGS: [Even if no action — what we now know that we didn't before]
Advanced: operator-side server diagnostics + reputation recovery cycles

When dashboard checks point at infrastructure, the operator's checklist:

Worker process health:

ssh acelle@acellemail.com "
  ps aux | grep -E 'queue:work' | grep -v grep | wc -l
  # Expect 1+ per --queue= pool defined in supervisor config

  php artisan queue:size
  # Should drop toward zero; if growing, workers are stuck

  php artisan queue:failed | wc -l
  # If >0, jobs gave up; investigate:
  php artisan queue:failed
"

Restart workers (most common single fix):

ssh acelle@acellemail.com "sudo supervisorctl restart all"

Outbound SMTP smoke from the AcelleMail host:

ssh acelle@acellemail.com "nc -zv email-smtp.us-east-1.amazonaws.com 587"
# Expected: succeeded. Failure = firewall outbound block.

TLS handshake check:

ssh acelle@acellemail.com "
  openssl s_client -connect smtp.sendgrid.net:587 -starttls smtp -crlf < /dev/null 2>/dev/null \
    | grep -E '(subject|issuer|verify)'
"

DNS check from the AcelleMail host:

ssh acelle@acellemail.com "
  dig TXT yourdomain.com +short | grep -i spf
  dig TXT default._domainkey.yourdomain.com +short
  dig TXT _dmarc.yourdomain.com +short
"

If any returns empty, DNS not propagated or records misconfigured.

Reputation recovery cycle (when reputation hit confirmed):

Day 0:  PAUSE all bulk sending. Re-verify domain. Confirm SPF/DKIM/DMARC green.
        Begin sending ONLY to engaged-last-7d segment (highest open rate).
Day 7:  If engagement-segment sends stayed clean, expand to engaged-last-30d.
Day 14: Expand to engaged-last-90d. Daily check of SNDS / Gmail Postmaster.
Day 30: Resume to full segment if reputation back to baseline.

Mild reputation dips recover in 1-2 weeks; severe (>1% sustained complaint) takes 4-6 weeks or requires fresh IP + warm-up cycle.

Post-mortem automation — wire the incident-template to a shared Notion / Confluence page via webhook. Every incident auto-creates a new entry; team fills it in within 48h. Builds institutional knowledge over time.

Pre-incident detection — daily-run audit script that exit-codes non-zero if:

  • Yesterday's bounce rate > 4%
  • Yesterday's complaint rate > 0.2%
  • Any sending server has DKIM/SPF red chip
  • Queue depth at 8am exceeds 10× the average backlog

Wire to PagerDuty / Slack. Catches creeping problems before customers do.

Related articles

17 comments

9 comments

  1. hung.nguyen.it
    This is the clearest IP warmup schedule I've found. The volume table at the top is what I'm referencing daily
    1. admin
      appreciate it. if anything in this needs updating, ping us — we revisit articles every few months.
  2. tranminh.devop…
    We hit a Spamhaus listing once. Self-service delisting was actually fast (< 24h) but the reputation recovery took weeks. Not the listing itself that hurt — the user complaints that caused it
    1. admin
      Great real-world detail. Your point about stale running_pid > 30 min as an alert is something we should add to the diagnostic flow. :)
  3. phuong.mai.hn
    Confirming the Postmaster Tools data lag — sometimes 48 hours, sometimes longer. Don't make decisions on a single day's data.
  4. ahmed.hassan.c…
    We warmed up a dedicated IP last fall. The 2-week ramp this article describes is on the aggressive side — Gmail in particular punishes anything faster than ~3-4 weeks. We did 4 weeks and had a clean ramp. anyway
  5. lequan.saigon
    The Postmaster Tools section is gold. Most senders don't even know it exists.
  6. cmendoza.mx
    Does engagement-based segmentation help during warmup? E.g. only sending to the most-engaged 20% during week 1?
    1. admin
      Good question — and one that comes up often enough we should add an FAQ section. Short answer: yes for the common case; the exception is when you're running custom plugins that override the default behavior...
  7. danrey.dev
    If you're warming a new IP after a known issue, consider seeding with transactional mail first (password resets, order confirmations). Higher engagement rate per send than marketing — helps the reputation ramp.
  8. femi.adeyemi
    Bookmarked. Going to share with the team — we've been winging warmup and it shows in the numbers...
    1. admin (edited)
      Thanks for the kind words. We try to keep these source-grounded so they age well...
  9. v.petrova.ru
    For very low-volume senders (< 5k/month), does warmup even matter? Or just send and let the provider's shared pool absorb the trickle?
    1. admin
      we're aware of the silent-bail-out on deleted customers — there's an open issue for it. workaround for now: monitor the campaign:rerun log for absence of expected log lines, alert when silent for > 20 min.
    2. admin (edited)
      Honest answer: it depends on your provider. SES handles it gracefully; Mailgun is stricter. We'll add a provider-by-provider table in the next revision.
    3. admin (edited)
      There's no built-in way today. Two workarounds: (1) cron + custom script polling the API every N minutes, (2) webhook-driven if your event source supports it. Most operators go with #2...
    4. admin (edited)
      Short answer: yes — set the MySQL session variable from your workers .env on boot and you'll get the longer timeout per connection. We'll add an explicit recipe in the next refresh.

More in Sending & Deliverability