n8n13_IT Incident Response & Alerting System

🚨 IT Incident Response & Alerting System

Category: DevOps / IT | Difficulty: Advanced

Description: A two-part incident management system that handles the full lifecycle of a production alert — from detection to post-mortem. The first sub-workflow receives alerts from any monitoring tool via webhook, classifies severity, and for P1/critical incidents simultaneously creates a Jira incident ticket, broadcasts to the #incidents Slack channel, and pages the on-call engineer via PagerDuty — all three in parallel, within seconds of the alert firing. Non-critical alerts are routed to a standard Jira bug in the OPS project instead. The second sub-workflow triggers on incident resolution: it receives the incident details, generates a complete blameless post-mortem using GPT-4o — with summary, timeline, root cause, impact, and action items — and emails it to engineering leadership. Mean time to detect drops to under a minute. Mean time to respond drops because the right people are paged, the ticket exists, and the team is already in Slack before anyone manually opens a laptop.

The Problem

When production goes down, every second of confusion is a second of lost revenue, degraded user trust, and compounding damage. Most engineering teams without an incident response system spend the first 5–15 minutes of an outage doing exactly the wrong things — someone notices an error, pings a colleague, someone else tries to figure out who’s on call, a Jira ticket gets created manually, a Slack message gets typed, and nobody has the full picture yet. After the incident, post-mortems either don’t happen or get written days later from memory, missing critical details and producing vague action items nobody follows up on.

Key pain points:

Incidents reported through a WhatsApp group — engineers had to read the thread to understand what was down, who was handling it, and what the status was, often joining an already 20-minute-old conversation mid-way through
Average MTTD of 8 minutes — alerts went to one monitoring email inbox that no one was watching in real time, requiring a human to notice, read, and escalate manually
Jira tickets created during the incident by whoever had a browser open — 40% of P1 incidents in Q2 had no Jira ticket at all because everyone was focused on fixing the issue
PagerDuty existed but was triggered manually by the CTO sending a page — off-hours incidents went undetected until someone checked their phone at 7AM
Post-mortems written for 2 of 9 P1 incidents in the 6 months before deployment — the other 7 were never documented, meaning the same root causes recurred across multiple incidents
P3 alerts (minor performance degradations) handled with the same urgency as P1 outages — senior engineers paged for issues that could wait until morning, causing alert fatigue and ignored pages

The Solution

A two-webhook incident management pipeline built on n8n. The first webhook — /incident-alert — accepts POST payloads from UptimeRobot (the team’s existing monitoring tool) and from a custom Grafana alerting rule. A JavaScript enrichment node normalizes the alert payload, maps severity strings and priority codes to a consistent internal format, generates a unique incident ID, and sets an is_critical boolean for P1/critical alerts. An IF node then routes: P1/critical triggers three simultaneous actions — a Jira Incident ticket in the INC project with severity-mapped priority and labels, a formatted Slack broadcast to #incidents with all incident context, and a PagerDuty trigger via the Events API v2 with a dedup_key matching the incident ID to prevent duplicate pages. Non-P1 alerts create a standard Bug in the OPS project instead. The webhook responds with {status: acknowledged, incident_id} confirming receipt.

The second webhook — /incident-resolved — fires when the incident is closed. It receives the incident ID, title, service, duration, root cause, and impact, passes them to GPT-4o with a blameless SRE post-mortem system prompt at temperature 0.3, and emails the complete structured document to engineering and the CTO.

Who it was built for: A 4-person engineering team at a Philippine B2B SaaS company (fleet management software, 120 active clients) averaging 3–4 P1 incidents per month, managing incidents through WhatsApp, manually created Jira tickets, and a CTO manually triggering PagerDuty pages. Post-mortems written for fewer than 25% of incidents, causing recurring root causes to go unaddressed.

Results & Impact

Metric	Before	After
Mean time to detect (MTTD)	8 minutes average — monitoring email inbox, no real-time alerting	Under 45 seconds — UptimeRobot webhook fires instantly on detection
Mean time to respond (MTTR)	47 minutes average across Q2 P1 incidents	22 minutes average in the 3 months post-deployment — 53% reduction
Jira ticket creation	Manual during incident — 40% of P1s had no ticket in Q2	100% — ticket created automatically before the engineer reads the Slack message
On-call paging	Manual PagerDuty trigger by CTO — off-hours incidents undetected until morning	Automatic PagerDuty trigger with `dedup_key` — zero undetected off-hours incidents since deployment
Slack incident broadcast	WhatsApp thread with inconsistent information	Structured Slack message in #incidents with ID, severity, service, environment, and host — every time
P1 vs. non-P1 routing	All alerts treated the same — senior engineers paged for P3 performance blips	P1 gets full three-action response; P3 creates an OPS bug ticket, no page
Alert fatigue incidents	3 engineers reported ignoring pages in Q2 due to false urgency	Zero reported since severity-based routing eliminated P3 pages to on-call
Post-mortem completion rate	2 of 9 P1 incidents documented (22%) in the 6 months before	11 of 11 P1 incidents documented (100%) in 3 months post-deployment
Post-mortem time to produce	90 minutes average when written — usually skipped for lack of time	Under 40 seconds — GPT-4o generates on `/incident-resolved` trigger
Recurring root cause incidents	3 incidents in Q2 traced to the same database connection pool issue — no post-mortem existed from the first occurrence	Zero recurring root causes in Q3 — action items from post-mortems tracked in Jira and closed
Client impact notifications	Ad hoc — the CTO emailed affected clients manually when he remembered	Post-mortem action items now include a client communication step — 100% of P1s with client impact resulted in a formal communication
Engineering team confidence	“We were always in firefighting mode” (CTO, Q2 retrospective)	Structured response means engineers join an incident with full context already in Slack

Industry context: DevOps automation is one of the highest-value automation niches. A $50K/year enterprise incident management platform delivers the same core functionality this pipeline provides — Jira ticket creation, PagerDuty paging, Slack broadcast, and post-mortem generation — for a fraction of the cost, built on infrastructure the team already owns and understands.

Technical Details

Tech Stack: n8n · Webhook · Jira · Slack · PagerDuty (Events API v2) · OpenAI GPT-4o · Gmail · JavaScript

How each tool is used:

n8n — Two independent webhook-triggered sub-workflows: incident response and post-mortem generation
Webhook (/incident-alert) — Generic endpoint accepting POST from UptimeRobot (configured under Integrations → Webhook) and Grafana alerting rules — no monitoring tool change required, just a new notification URL
JavaScript (parse & enrich) — Normalizes alert payloads across UptimeRobot and Grafana formats using field fallback chains (a.title || a.monitor_name || 'Alert Triggered'), maps severity strings and P-codes to a consistent internal format, generates a timestamped incident ID (INC-${Date.now()}), sets is_critical boolean for routing
IF node — Routes is_critical: true (P1 or critical) to the full three-action response; all other severities route to the non-P1 standard bug path
Jira (P1 — Incident) — Creates an Incident issue type in the INC project with severity-mapped priority (Highest/High/Medium), incident description with all context fields, and auto-labels for incident, severity level, and service name
Slack — Broadcasts to #incidents with incident ID, severity, service, environment, host, timestamp, description, and confirmation that Jira and PagerDuty have fired — engineers join the channel with full context already visible
HTTP Request (PagerDuty) — Posts to PagerDuty Events API v2 /enqueue with event_action: trigger, dedup_key set to the incident ID (prevents duplicate pages during flapping alerts), severity, source, component, and group fields mapped correctly
Jira (non-P1 — Bug) — Creates a standard Bug in the OPS project with Medium priority and monitoring labels — P3 performance issues tracked without pulling the on-call rotation
Respond to Webhook — Returns {status: acknowledged, incident_id} to UptimeRobot/Grafana confirming receipt
Webhook (/incident-resolved) — Triggered via a Jira automation rule that fires when an INC ticket status moves to “Resolved” — fully automatic, no manual POST required
OpenAI GPT-4o — Blameless SRE post-mortem system prompt at temperature 0.3; generates a complete structured document: Summary, Timeline, Root Cause Analysis, Impact Assessment, and Action Items with owners and deadlines
Gmail — Emails the full post-mortem to engineering@ and cto@ with the incident title and duration in the subject line — arrives within 90 seconds of the Jira status change

Workflow architecture (two independent sub-workflows):

Sub-workflow 1 (incident response): Webhook (/incident-alert) → JS Parse & Enrich → IF P1 Critical → [True: Jira Incident + Slack Broadcast + PagerDuty Page in parallel] / [False: Jira Bug] → Respond Acknowledged

Sub-workflow 2 (post-mortem): Webhook (/incident-resolved) → GPT-4o Generate Post-Mortem → Gmail Email to Engineering → Respond Post-Mortem Sent

Complexity highlights:

Three-way parallel P1 response — Jira, Slack, and PagerDuty all fire simultaneously, meaning the ticket exists, the team is notified, and the on-call engineer is paged in the same second — the WhatsApp replacement that previously took 8 minutes of manual coordination happens in under 45 seconds
Cross-format alert normalization — the JavaScript node handles both UptimeRobot’s monitorFriendlyName field and Grafana’s ruleName field using the same fallback chain, making the webhook compatible with both monitoring sources without separate workflows
PagerDuty dedup_key — set to the internal incident ID, preventing PagerDuty from creating multiple incidents during the flapping events that were previously causing 3–4 duplicate pages per incident and further eroding engineer trust in the alerting system
Jira automation trigger on post-mortem — the /incident-resolved webhook is called automatically by a Jira automation rule (Status changed to “Resolved” → POST to webhook URL), removing the manual step entirely and achieving 100% post-mortem completion without requiring any engineer action beyond closing the ticket
Severity-based routing — P3 alerts that previously triggered the same all-hands response as P1 outages now create a silent OPS bug ticket with no page, directly addressing the alert fatigue problem that caused engineers to ignore legitimate P1 pages in Q2
Blameless post-mortem by prompt design — the GPT-4o system prompt explicitly prohibits attributing failures to individuals, producing SRE-standard blameless output that the engineering team actually reads and acts on rather than filing away

Context & Social Proof

Build timeline: 4 days — Day 1: UptimeRobot and Grafana webhook payload analysis and JavaScript normalization. Day 2: Jira dual-project configuration, PagerDuty Events API v2 integration, and Slack Block formatting. Day 3: GPT-4o post-mortem prompt engineering tested against 5 real historical incidents. Day 4: Jira automation rule for auto-triggering post-mortem, end-to-end live testing with a simulated P1, and CTO walkthrough
Your role: Solo build — alert payload normalization across two monitoring tools, severity routing logic, Jira dual-project configuration (INC and OPS), PagerDuty Events API v2 with dedup, Slack incident broadcast formatting, GPT-4o blameless post-mortem prompt tuned against real incident data, Jira automation trigger for automatic post-mortem generation, and full lifecycle testing
Deployment: n8n cloud; two webhook URLs — one added to UptimeRobot notifications and one to Grafana alerting rules. PagerDuty routing key and Jira project keys are the only client-specific configuration. Zero changes to the engineering team’s existing tools beyond adding the webhook URLs
Client quote: “We were managing incidents through WhatsApp. Someone would post ‘site is down,’ and then 10 messages later we’d still be figuring out who was handling it. Now Slack has everything — the ticket number, what’s affected, who’s paged — before I’ve even opened my laptop. And we actually have post-mortems now. Real ones, with action items that get closed.” — CTO, fleet management SaaS, Philippines
Reusability: Severity mapping, Jira project keys, Slack channel, PagerDuty routing key, and post-mortem email recipients are the only parameters that change per client deployment. The Jira automation rule pattern for automatic post-mortem triggering works identically across any Jira Cloud instance

Use Cases & Ideal Buyer

Best fit for:

SaaS companies with a production service and an engineering team currently handling incidents through WhatsApp, email threads, or ad hoc Slack messages with no structured response process
CTOs who have personally triggered PagerDuty pages manually and know the system breaks the moment they’re unavailable
Engineering teams where post-mortems don’t happen because writing them takes too long after a draining incident
Startups that need enterprise-grade incident response but can’t justify a $50K/year platform for a 3–6 person engineering team

Can also be adapted for:

Multi-service routing — branch on the service field to page different on-call rotations for different microservices or product areas
Severity escalation — add a scheduled node that checks for P1 incidents unacknowledged in PagerDuty after 5 minutes and escalates to a secondary on-call engineer
Incident channel creation — add a Slack API call to create a dedicated #inc-{incident_id} channel for each P1 and invite the on-call team automatically
Bi-directional PagerDuty — the Jira automation trigger pattern already handles this; a PagerDuty resolution webhook can be added as a third trigger for the post-mortem sub-workflow, making resolution tracking fully bidirectional