How Internal IT Catches Business System Outages Before Anyone Reports Them
How a 2-person internal IT team uses Miterl to monitor business systems, SAML SSO, internal Slack, and overnight batch jobs — cutting employee-impacting downtime by 30% year over year.
The team and assumptions
A 400-person mid-sized company, "Company E," with a 2-person internal IT team responsible for both SaaS (Slack / Salesforce / Workday / Zendesk) and a portfolio of in-house business systems (HR, expense reimbursement, internal portal).
The goal of adopting Miterl was to stop being the team that finds out about outages from a flood of help-desk tickets. After a year, the result: total employee-impact time dropped 30% year over year.
Three monitoring needs unique to internal IT
1. "Quietly broken" business systems
If the expense-reimbursement system goes down mid-month, no one notices. It surfaces at month-end when everyone tries to file at once — by which point the system has often been down for several days to a week.
2. SAML SSO / IdP failures
A broken Okta or Azure AD config locks employees out of every SSO-integrated SaaS at once. The help desk receives "I can't log into Slack" / "Salesforce isn't working" tickets one at a time, but the root cause is the IdP — and you want to detect it there, not at the symptom.
3. Overnight batch sync failures
The HR-to-payroll nightly sync job: if it fails once, the next day's payroll is wrong. Catching it requires someone to actually look at the batch log — which, predictably, no one does until something breaks.
A three-layer internal monitoring setup
Layer 1: HTTP monitoring of business systems
Internal-only systems are monitored from Miterl probes by whitelisting the published probe IPs at the firewall.
| System | Monitor type | Interval |
|---|---|---|
| Internal portal | HTTP + keyword (login page) | 5 min |
| Expense reimbursement | HTTP (public login URL) | 5 min |
| File sharing | HTTP (login-page header keyword) | 10 min |
| HR system | HTTP (publicly-reachable SSO login) | 10 min |
Layer 2: SAML SSO liveness
The Okta sign-in page (https://e-corp.okta.com) is watched as a keyword monitor. When Okta is down or misconfigured, the page either omits the expected text (Sign in to E-Corp) or returns an error HTML — both detectable in seconds.
A monthly synthetic-login batch using a test account adds a deeper check, catching SAML certificate or claims-mapping breakage that the surface-level keyword check would miss.
Layer 3: Overnight batch heartbeat
The HR-to-payroll sync emits a heartbeat ping on success.
#!/bin/bash
# /opt/jobs/payroll_sync.sh
set -e
./run_payroll_sync.py
echo "Sync OK at $(date)"
# Heartbeat only on success
curl -fsS -m 10 --retry 3 \
https://miterl.com/heartbeat/$PAYROLL_TOKEN
Miterl is configured to fire DOWN if no heartbeat arrives by 2 AM. The alert hits both IT engineers in Slack, plus the IT manager via email as a safety net.
Splitting alerts so 400 people don't get spammed
Pinging "outage detected" to all 400 employees creates more chaos than it solves. Per-monitor routing handles the segmentation:
| Channel | What it receives | Audience |
|---|---|---|
Slack #it-internal |
All alerts | 2 IT engineers |
Slack #announce-it |
Wide-impact outages (company-wide systems) | All employees (IT manually reposts after triage) |
| Critical only | IT manager, CIO |
"Should this go company-wide?" is always a human judgment call, never an automated broadcast — which has prevented every false-alarm-spawned company-wide panic.
Auto-suppressing the regular maintenance window
Sundays from 02:00 to 04:00 are the standard maintenance window. A cron job auto-fires a maintenance webhook to suppress monitoring during that period:
# crontab: every Sunday at 02:00, declare a 2-hour maintenance window
0 2 * * 0 curl -X POST https://miterl.com/api/v1/webhooks/maintenance/$TOKEN/start \
-d '{"duration_hours": 2, "name": "Sunday maintenance window"}'
This single setup eliminated all false alerts caused by scheduled maintenance.
Monthly executive reporting
Miterl's monthly report goes straight into the CIO update:
- Per-system uptime (expense reimbursement 99.95%, HR system 99.8%, etc.)
- Incident counts and mean time to recovery
- Batch failure counts and root causes
Numbers like "expense reimbursement dropped to 99.6% last month — below our 99.5% floor" make system-replacement prioritization a data conversation rather than a gut-feel one.
Related reading
- Miterl Documentation — probe locations and IP whitelists
- HMAC signature verification guide — webhook source verification
- Pricing — Pro plan SLA reporting
- All use cases — playbooks for other industries