2026-04-01

Incident Response Playbook: Detection to Post-Mortem

incident response operations web agency playbook

Do You Have a Plan for When a Client Site Goes Down?

When a client's website stops responding, every minute matters. Without a documented response process, teams waste time figuring out who should do what instead of fixing the problem. Worse, delayed client communication compounds the damage.

This guide walks through a 5-step incident response framework: Detect, Assess, Communicate, Recover, and Report.

Step 1: Detect -- Catch the Problem Immediately

Manual checking does not scale. If you manage more than a handful of client sites, automated monitoring is non-negotiable.

Miterl monitors your sites continuously and sends alerts through Slack, email, or webhooks the moment something fails.

Alert contacts (notification destinations) are created in the dashboard and attached to each monitor. When you create a monitor via the API, pass the contact IDs returned by GET /alert-contacts in the alert_contact_ids field.

# List your alert contacts and their IDs
curl -s https://miterl.com/api/v1/alert-contacts \
  -H "Authorization: Bearer YOUR_API_KEY" | \
  jq '.data[] | {id, type, name}'

Pass those IDs as alert_contact_ids when creating a monitor to wire up down/recovery notifications. With a 60-second interval_seconds, Miterl detects outages within a minute. The faster you know, the faster you respond.

Step 2: Assess -- Understand the Scope

When an alert fires, resist the urge to start fixing things immediately. Take 2-3 minutes to assess the situation first.

Run through this checklist:

[ ] Which site(s) are affected?
[ ] What type of failure? (HTTP error, SSL, DNS, timeout)
[ ] Were any recent changes deployed?
[ ] Is the issue isolated or affecting multiple clients?
[ ] Assign an incident owner

Check Miterl's dashboard for the affected monitor's status history and response logs. This data helps you narrow down root cause quickly rather than guessing.

Step 3: Communicate -- Inform the Client Within 15 Minutes

Send the first client notification within 15 minutes of detection, even if you do not yet know the cause. Silence is worse than incomplete information.

Your first message should cover three points:

You are aware of the issue
Investigation is underway
When the next update will be sent

Template:

We have detected an issue affecting your website and are currently investigating. We will provide an update within 30 minutes.

If you use status pages, update the status to "Investigating" immediately. This reduces inbound inquiries while you focus on resolution.

Step 4: Recover -- Fix the Problem

Apply the appropriate fix based on the failure type. Here are the most common patterns:

Failure	Likely Cause	Action
HTTP 503	Server overload	Restart services, scale resources
SSL error	Expired certificate	Renew or reissue the certificate
DNS failure	Misconfigured records	Correct DNS records, verify TTL
Timeout	Network issue	Contact hosting provider

Continue sending client updates every 30 minutes during extended outages. Each update should state what you have tried and what you are doing next.

Step 5: Report -- Write the Post-Mortem

Within 24 hours of recovery, produce an incident report. This document serves two purposes: it reassures the client that you take reliability seriously, and it gives your team a reference to prevent recurrence.

Include these sections:

Timeline: When the incident started, was detected, and was resolved
Impact: Which sites and features were affected
Root cause: What went wrong, explained clearly
Response log: Chronological list of actions taken
Prevention: Specific steps to avoid a repeat (e.g., add SSL expiry monitoring, increase server capacity)

Building the Playbook Into Your Team

A response plan only works if everyone knows it exists. Take these steps:

Store the playbook in a shared location (wiki, Notion, or repo)
Assign on-call rotations with clear escalation paths
Run a tabletop drill quarterly to keep the process fresh
Review post-mortems as a team to internalize lessons

To feed your internal tools or Slack bots, you can pull the current open incidents from Miterl's API and surface them where your team works:

# List ongoing (unresolved) incidents to drive your own tooling
curl -s "https://miterl.com/api/v1/incidents?status=ongoing" \
  -H "Authorization: Bearer YOUR_API_KEY" | \
  jq '.data[] | {id, monitor_id, severity, cause, started_at, is_acknowledged}'

Summary

Incident response is preparation, not improvisation. A documented playbook reduces mean time to recovery and protects client trust during the moments that matter most. Miterl's automated detection and multi-channel alerting handle the first critical step so your team can jump straight to action.

Once you have a response process in place, Step 5 becomes much faster with a ready-made template. "Incident Report Template for Web Agencies" provides a copy-paste-ready post-mortem format you can hand directly to clients, covering timeline, root cause, and prevention steps.

Explore the documentation to configure alerting, test your setup by signing up for free, and read more operational guides on the blog.