Monitoring Alert Escalation for Agencies: Who and When
"What happens if an outage hits on my day off?"
Once you have monitoring set up and alerts firing, the next problem is who gets notified, when, and through which channel. For an agency where one engineer covers every client site, there is a constant worry: what happens if an outage hits while I'm asleep, driving, or on vacation?
This guide breaks down monitoring alert escalation — the practice of automatically raising an alert to the next person or channel when no one responds — for web agencies. The goal is a setup where off-hours, weekend, and unattended outages never slip through. Configuring the notification channels themselves is covered in "Sending Uptime Alerts to Slack and Chatwork"; this article focuses on how to stage those notifications.
Why Escalation Design Matters
A setup with a single notification target has three holes:
| Hole | What goes wrong |
|---|---|
| The primary on-call misses it | Asleep, driving, or heads-down on other work and never sees the alert |
| Only one delivery channel | If Slack is down or the phone is out of signal, nothing arrives |
| No severity distinction | A one-second blip and a full outage make the same noise, so real incidents get buried |
Escalation design closes these holes with a simple rule: if there's no acknowledgment within a set time, automatically raise the alert to the next person or channel. For an agency managing 10 or 20 client sites, this is the lifeline that keeps you within SLA (for organizing alerts across many sites, see "Tips for Monitoring Multiple Sites Efficiently").
The Three Elements of Escalation
Escalation is built from three elements: who, when, and through which channel.
1. Who (the notification tiers)
- Tier 1: Primary engineer (the person who works on the site day to day)
- Tier 2: Team lead / backup engineer (when Tier 1 doesn't respond)
- Tier 3: Owner / director (when a major outage drags on)
2. When (time-based triggers)
- 0 minutes after detection: notify Tier 1 immediately
- 5–10 minutes with no response: escalate to Tier 2
- 30 minutes unresolved: escalate to Tier 3 (the decision-maker)
3. Through which channel (channel redundancy)
The higher the urgency, the harder-to-ignore the channel.
| Tier | Recommended channel | Why |
|---|---|---|
| Tier 1 | Slack / Chatwork | Most visible during working hours |
| Tier 2 | Email + Slack mention | Spreads delivery to prevent a miss |
| Tier 3 | Phone / SMS (manual) | The last resort that reliably wakes someone |
A Tiered Escalation Example
A realistic three-stage template keeps agency operations stable:
=== Escalation rule (example) ===
[Level 1] Outage detected -> immediate
Notify: Primary engineer (Slack #alerts-clientA)
Goal: the owner notices and starts triage
[Level 2] No response after 10 min -> escalate
Notify: Team lead (email + Slack mention)
Goal: backup when Tier 1 can't respond
[Level 3] Unresolved after 30 min -> escalate
Notify: Director (phone / SMS)
Goal: grasp a major outage before the client does and decide on first contact
The full flow including post-recovery client communication is covered in "Website Incident Response for Agencies."
Designing Off-Hours and Weekend Escalation
The hardest part of escalation design is nights and weekends, where two demands conflict:
- Don't miss anything: an overnight outage on an e-commerce or checkout site directly costs revenue
- Don't burn out: getting woken every night by minor blips destroys the team
You resolve this contradiction by filtering on severity before escalating.
| Site type | Off-hours policy |
|---|---|
| E-commerce / checkout / booking | Escalate immediately even at night (high revenue impact) |
| Corporate / blog sites | Hold a summary notification, handle next morning |
| All types | A one-second blip notifies only after retry confirmation (excludes false positives) |
The concrete setup for keeping minor alerts quiet at night (quiet hours, retries, maintenance windows) is detailed in "Preventing Alert Fatigue in Monitoring." Escalation is the design for "reliably ringing what should ring," and alert-fatigue prevention is the design for "silencing what shouldn't" — they work as two halves of the same system.
Implementing Escalation in Miterl
In Miterl you register multiple alert contacts and attach them to monitors to build the foundation for escalation.
Prepare alert contacts by tier
Create contacts like "Tier 1 (Slack)" and "Tier 2 (email)" in the dashboard, then list them and their IDs via the API.
# List registered alert contacts and their IDs
curl -s "https://miterl.com/api/v1/alert-contacts" \
-H "Authorization: Bearer YOUR_API_KEY" \
| jq -r '.data[] | "\(.id)\t\(.type)\t\(.name)"'
Example output:
1 slack Tier 1: primary Slack
2 email Tier 2: team lead email
3 slack Tier 3: director mention
Attach multiple contacts to a monitor
For high-severity monitors (e-commerce / checkout), attach all of the Tier 1–3 contacts to add redundancy.
# Attach Tier 1-3 contacts to an e-commerce monitor at once
curl -s -X PATCH "https://miterl.com/api/v1/monitors/1" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"alert_contact_ids": [1, 2, 3]}'
Suppress minor off-hours alerts
Setting company-wide quiet hours means only major outages notify at night, while minor alerts wait until morning. This leaves only the "escalate immediately even at night" incidents on your off-hours channel.
# Review which monitors map to which contacts
curl -s "https://miterl.com/api/v1/monitors?per_page=100" \
-H "Authorization: Bearer YOUR_API_KEY" \
| jq -r '.data[] | "\(.name)\t-> contacts: \(.alert_contact_ids // [])"'
You can't create the contacts themselves via the API, but using the IDs of contacts created in the dashboard, you can configure "which site goes to which tier" programmatically in bulk.
Keeping Escalation From Going Stale
A rule that isn't maintained is meaningless. Reviewing these three points keeps escalation working:
- Contact audit: make sure no contacts left behind by departures or reassignments are still active
- Test notifications: once a quarter, verify that alerts actually reach Tier 2 and Tier 3
- Threshold tuning: confirm the 10-minute / 30-minute escalation timings match your team's real response speed
The escalation log (who was notified, and when) also serves as objective evidence for an incident report. For how to write one up, see the "Incident Report Template."
Summary
Monitoring alert escalation is built from three elements — who, when, and through which channel:
- Who: build a Tier 1 (primary) → Tier 2 (lead) → Tier 3 (director) hierarchy
- When: escalate on time triggers — 0 min on detection, 10 min no-response, 30 min unresolved
- Through which channel: shift to harder-to-ignore channels (Slack → email → phone) as urgency rises
- Off-hours: filter on severity before escalating, and silence minor alerts with quiet hours
Even with a one-person team, escalation design ensures that "an outage on your day off still reaches someone." Sign up for free to register multiple contacts, and see the documentation for configuration details. Agency operations examples are in the use cases section, and common questions are answered in the FAQ.