Incident Response Playbook: Detection to Post-Mortem
Do You Have a Plan for When a Client Site Goes Down?
When a client's website stops responding, every minute matters. Without a documented response process, teams waste time figuring out who should do what instead of fixing the problem. Worse, delayed client communication compounds the damage.
This guide walks through a 5-step incident response framework: Detect, Assess, Communicate, Recover, and Report.
Step 1: Detect -- Catch the Problem Immediately
Manual checking does not scale. If you manage more than a handful of client sites, automated monitoring is non-negotiable.
Miterl monitors your sites continuously and sends alerts through Slack, email, or webhooks the moment something fails.
# Configure a webhook alert for downtime events
curl -X POST https://api.miterl.com/v1/alerts \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"monitor_id": "mon_abc123",
"channel": "webhook",
"endpoint": "https://your-app.example.com/webhook/incident",
"events": ["monitor.down", "monitor.recovery"]
}'
With 1-minute check intervals, Miterl detects outages within 60 seconds. The faster you know, the faster you respond.
Step 2: Assess -- Understand the Scope
When an alert fires, resist the urge to start fixing things immediately. Take 2-3 minutes to assess the situation first.
Run through this checklist:
[ ] Which site(s) are affected?
[ ] What type of failure? (HTTP error, SSL, DNS, timeout)
[ ] Were any recent changes deployed?
[ ] Is the issue isolated or affecting multiple clients?
[ ] Assign an incident owner
Check Miterl's dashboard for the affected monitor's status history and response logs. This data helps you narrow down root cause quickly rather than guessing.
Step 3: Communicate -- Inform the Client Within 15 Minutes
Send the first client notification within 15 minutes of detection, even if you do not yet know the cause. Silence is worse than incomplete information.
Your first message should cover three points:
- You are aware of the issue
- Investigation is underway
- When the next update will be sent
Template:
We have detected an issue affecting your website and are currently investigating. We will provide an update within 30 minutes.
If you use status pages, update the status to "Investigating" immediately. This reduces inbound inquiries while you focus on resolution.
Step 4: Recover -- Fix the Problem
Apply the appropriate fix based on the failure type. Here are the most common patterns:
| Failure | Likely Cause | Action |
|---|---|---|
| HTTP 503 | Server overload | Restart services, scale resources |
| SSL error | Expired certificate | Renew or reissue the certificate |
| DNS failure | Misconfigured records | Correct DNS records, verify TTL |
| Timeout | Network issue | Contact hosting provider |
Continue sending client updates every 30 minutes during extended outages. Each update should state what you have tried and what you are doing next.
Step 5: Report -- Write the Post-Mortem
Within 24 hours of recovery, produce an incident report. This document serves two purposes: it reassures the client that you take reliability seriously, and it gives your team a reference to prevent recurrence.
Include these sections:
- Timeline: When the incident started, was detected, and was resolved
- Impact: Which sites and features were affected
- Root cause: What went wrong, explained clearly
- Response log: Chronological list of actions taken
- Prevention: Specific steps to avoid a repeat (e.g., add SSL expiry monitoring, increase server capacity)
Building the Playbook Into Your Team
A response plan only works if everyone knows it exists. Take these steps:
- Store the playbook in a shared location (wiki, Notion, or repo)
- Assign on-call rotations with clear escalation paths
- Run a tabletop drill quarterly to keep the process fresh
- Review post-mortems as a team to internalize lessons
Summary
Incident response is preparation, not improvisation. A documented playbook reduces mean time to recovery and protects client trust during the moments that matter most. Miterl's automated detection and multi-channel alerting handle the first critical step so your team can jump straight to action.
Explore the documentation to configure alerting, test your setup in the Playground, and read more operational guides on the blog.