Internal IT 2026-04-28

How Internal IT Catches Business System Outages Before Anyone Reports Them

How a 2-person internal IT team uses Miterl to monitor business systems, SAML SSO, internal Slack, and overnight batch jobs — cutting employee-impacting downtime by 30% year over year.

internal IT business systems SAML back office Heartbeat

The team and assumptions

A 400-person mid-sized company, "Company E," with a 2-person internal IT team responsible for both SaaS (Slack / Salesforce / Workday / Zendesk) and a portfolio of in-house business systems (HR, expense reimbursement, internal portal).

The goal of adopting Miterl was to stop being the team that finds out about outages from a flood of help-desk tickets. After a year, the result: total employee-impact time dropped 30% year over year.

Three monitoring needs unique to internal IT

1. "Quietly broken" business systems

If the expense-reimbursement system goes down mid-month, no one notices. It surfaces at month-end when everyone tries to file at once — by which point the system has often been down for several days to a week.

2. SAML SSO / IdP failures

A broken Okta or Azure AD config locks employees out of every SSO-integrated SaaS at once. The help desk receives "I can't log into Slack" / "Salesforce isn't working" tickets one at a time, but the root cause is the IdP — and you want to detect it there, not at the symptom.

3. Overnight batch sync failures

The HR-to-payroll nightly sync job: if it fails once, the next day's payroll is wrong. Catching it requires someone to actually look at the batch log — which, predictably, no one does until something breaks.

A three-layer internal monitoring setup

Layer 1: HTTP monitoring of business systems

Internal-only systems are monitored from Miterl probes by whitelisting the published probe IPs at the firewall.

System	Monitor type	Interval
Internal portal	HTTP + keyword (login page)	5 min
Expense reimbursement	HTTP (public login URL)	5 min
File sharing	HTTP (login-page header keyword)	10 min
HR system	HTTP (publicly-reachable SSO login)	10 min

Layer 2: SAML SSO liveness

The Okta sign-in page (https://e-corp.okta.com) is watched as a keyword monitor. When Okta is down or misconfigured, the page either omits the expected text (Sign in to E-Corp) or returns an error HTML — both detectable in seconds.

A monthly synthetic-login batch using a test account adds a deeper check, catching SAML certificate or claims-mapping breakage that the surface-level keyword check would miss.

Layer 3: Overnight batch heartbeat

The HR-to-payroll sync emits a heartbeat ping on success.

#!/bin/bash
# /opt/jobs/payroll_sync.sh

set -e
./run_payroll_sync.py
echo "Sync OK at $(date)"

# Heartbeat only on success
curl -fsS -m 10 --retry 3 \
  https://miterl.com/heartbeat/$PAYROLL_TOKEN

Miterl is configured to fire DOWN if no heartbeat arrives by 2 AM. The alert hits both IT engineers in Slack, plus the IT manager via email as a safety net.

Splitting alerts so 400 people don't get spammed

Pinging "outage detected" to all 400 employees creates more chaos than it solves. Per-monitor routing handles the segmentation:

Channel	What it receives	Audience
Slack `#it-internal`	All alerts	2 IT engineers
Slack `#announce-it`	Wide-impact outages (company-wide systems)	All employees (IT manually reposts after triage)
Email	Critical only	IT manager, CIO

"Should this go company-wide?" is always a human judgment call, never an automated broadcast — which has prevented every false-alarm-spawned company-wide panic.

Auto-suppressing the regular maintenance window

Sundays from 02:00 to 04:00 are the standard maintenance window. A cron job auto-fires a maintenance webhook to suppress monitoring during that period:

# crontab: every Sunday at 02:00, declare a 2-hour maintenance window
0 2 * * 0 curl -X POST https://miterl.com/api/v1/webhooks/maintenance/$TOKEN/start \
  -d '{"duration_hours": 2, "name": "Sunday maintenance window"}'

This single setup eliminated all false alerts caused by scheduled maintenance.

Monthly executive reporting

Miterl's monthly report goes straight into the CIO update:

Per-system uptime (expense reimbursement 99.95%, HR system 99.8%, etc.)
Incident counts and mean time to recovery
Batch failure counts and root causes

Numbers like "expense reimbursement dropped to 99.6% last month — below our 99.5% floor" make system-replacement prioritization a data conversation rather than a gut-feel one.