Incident Response Playbook: Detection to Post-Mortem
Do You Have a Plan for When a Client Site Goes Down?
When a client's website stops responding, every minute matters. Without a documented response process, teams waste time figuring out who should do what instead of fixing the problem. Worse, delayed client communication compounds the damage.
This guide walks through a 5-step incident response framework: Detect, Assess, Communicate, Recover, and Report.
Step 1: Detect -- Catch the Problem Immediately
Manual checking does not scale. If you manage more than a handful of client sites, automated monitoring is non-negotiable.
Miterl monitors your sites continuously and sends alerts through Slack, email, or webhooks the moment something fails.
Alert contacts (notification destinations) are created in the dashboard and attached to each monitor. When you create a monitor via the API, pass the contact IDs returned by GET /alert-contacts in the alert_contact_ids field.
# List your alert contacts and their IDs
curl -s https://miterl.com/api/v1/alert-contacts \
-H "Authorization: Bearer YOUR_API_KEY" | \
jq '.data[] | {id, type, name}'
Pass those IDs as alert_contact_ids when creating a monitor to wire up down/recovery notifications. With a 60-second interval_seconds, Miterl detects outages within a minute. The faster you know, the faster you respond.
Step 2: Assess -- Understand the Scope
When an alert fires, resist the urge to start fixing things immediately. Take 2-3 minutes to assess the situation first.
Run through this checklist:
[ ] Which site(s) are affected?
[ ] What type of failure? (HTTP error, SSL, DNS, timeout)
[ ] Were any recent changes deployed?
[ ] Is the issue isolated or affecting multiple clients?
[ ] Assign an incident owner
Check Miterl's dashboard for the affected monitor's status history and response logs. This data helps you narrow down root cause quickly rather than guessing.
Step 3: Communicate -- Inform the Client Within 15 Minutes
Send the first client notification within 15 minutes of detection, even if you do not yet know the cause. Silence is worse than incomplete information.
Your first message should cover three points:
- You are aware of the issue
- Investigation is underway
- When the next update will be sent
Template:
We have detected an issue affecting your website and are currently investigating. We will provide an update within 30 minutes.
If you use status pages, update the status to "Investigating" immediately. This reduces inbound inquiries while you focus on resolution.
Step 4: Recover -- Fix the Problem
Apply the appropriate fix based on the failure type. Here are the most common patterns:
| Failure | Likely Cause | Action |
|---|---|---|
| HTTP 503 | Server overload | Restart services, scale resources |
| SSL error | Expired certificate | Renew or reissue the certificate |
| DNS failure | Misconfigured records | Correct DNS records, verify TTL |
| Timeout | Network issue | Contact hosting provider |
Continue sending client updates every 30 minutes during extended outages. Each update should state what you have tried and what you are doing next.
Step 5: Report -- Write the Post-Mortem
Within 24 hours of recovery, produce an incident report. This document serves two purposes: it reassures the client that you take reliability seriously, and it gives your team a reference to prevent recurrence.
Include these sections:
- Timeline: When the incident started, was detected, and was resolved
- Impact: Which sites and features were affected
- Root cause: What went wrong, explained clearly
- Response log: Chronological list of actions taken
- Prevention: Specific steps to avoid a repeat (e.g., add SSL expiry monitoring, increase server capacity)
Building the Playbook Into Your Team
A response plan only works if everyone knows it exists. Take these steps:
- Store the playbook in a shared location (wiki, Notion, or repo)
- Assign on-call rotations with clear escalation paths
- Run a tabletop drill quarterly to keep the process fresh
- Review post-mortems as a team to internalize lessons
To feed your internal tools or Slack bots, you can pull the current open incidents from Miterl's API and surface them where your team works:
# List ongoing (unresolved) incidents to drive your own tooling
curl -s "https://miterl.com/api/v1/incidents?status=ongoing" \
-H "Authorization: Bearer YOUR_API_KEY" | \
jq '.data[] | {id, monitor_id, severity, cause, started_at, is_acknowledged}'
Summary
Incident response is preparation, not improvisation. A documented playbook reduces mean time to recovery and protects client trust during the moments that matter most. Miterl's automated detection and multi-channel alerting handle the first critical step so your team can jump straight to action.
Once you have a response process in place, Step 5 becomes much faster with a ready-made template. "Incident Report Template for Web Agencies" provides a copy-paste-ready post-mortem format you can hand directly to clients, covering timeline, root cause, and prevention steps.
Explore the documentation to configure alerting, test your setup by signing up for free, and read more operational guides on the blog.