๐Ÿšจ IT Incident Response & Alerting System

Category: DevOps / IT | Difficulty: Advanced


Description: A two-part incident management system that handles the full lifecycle of a production alert โ€” from detection to post-mortem. The first sub-workflow receives alerts from any monitoring tool via webhook, classifies severity, and for P1/critical incidents simultaneously creates a Jira incident ticket, broadcasts to the #incidents Slack channel, and pages the on-call engineer via PagerDuty โ€” all three in parallel, within seconds of the alert firing. Non-critical alerts are routed to a standard Jira bug in the OPS project instead. The second sub-workflow triggers on incident resolution: it receives the incident details, generates a complete blameless post-mortem using GPT-4o โ€” with summary, timeline, root cause, impact, and action items โ€” and emails it to engineering leadership. Mean time to detect drops to under a minute. Mean time to respond drops because the right people are paged, the ticket exists, and the team is already in Slack before anyone manually opens a laptop.


The Problem

When production goes down, every second of confusion is a second of lost revenue, degraded user trust, and compounding damage. Most engineering teams without an incident response system spend the first 5โ€“15 minutes of an outage doing exactly the wrong things โ€” someone notices an error, pings a colleague, someone else tries to figure out who’s on call, a Jira ticket gets created manually, a Slack message gets typed, and nobody has the full picture yet. After the incident, post-mortems either don’t happen or get written days later from memory, missing critical details and producing vague action items nobody follows up on.

Key pain points:

  • Incidents reported through a WhatsApp group โ€” engineers had to read the thread to understand what was down, who was handling it, and what the status was, often joining an already 20-minute-old conversation mid-way through
  • Average MTTD of 8 minutes โ€” alerts went to one monitoring email inbox that no one was watching in real time, requiring a human to notice, read, and escalate manually
  • Jira tickets created during the incident by whoever had a browser open โ€” 40% of P1 incidents in Q2 had no Jira ticket at all because everyone was focused on fixing the issue
  • PagerDuty existed but was triggered manually by the CTO sending a page โ€” off-hours incidents went undetected until someone checked their phone at 7AM
  • Post-mortems written for 2 of 9 P1 incidents in the 6 months before deployment โ€” the other 7 were never documented, meaning the same root causes recurred across multiple incidents
  • P3 alerts (minor performance degradations) handled with the same urgency as P1 outages โ€” senior engineers paged for issues that could wait until morning, causing alert fatigue and ignored pages

The Solution

A two-webhook incident management pipeline built on n8n. The first webhook โ€” /incident-alert โ€” accepts POST payloads from UptimeRobot (the team’s existing monitoring tool) and from a custom Grafana alerting rule. A JavaScript enrichment node normalizes the alert payload, maps severity strings and priority codes to a consistent internal format, generates a unique incident ID, and sets an is_critical boolean for P1/critical alerts. An IF node then routes: P1/critical triggers three simultaneous actions โ€” a Jira Incident ticket in the INC project with severity-mapped priority and labels, a formatted Slack broadcast to #incidents with all incident context, and a PagerDuty trigger via the Events API v2 with a dedup_key matching the incident ID to prevent duplicate pages. Non-P1 alerts create a standard Bug in the OPS project instead. The webhook responds with {status: acknowledged, incident_id} confirming receipt.

The second webhook โ€” /incident-resolved โ€” fires when the incident is closed. It receives the incident ID, title, service, duration, root cause, and impact, passes them to GPT-4o with a blameless SRE post-mortem system prompt at temperature 0.3, and emails the complete structured document to engineering and the CTO.

Who it was built for: A 4-person engineering team at a Philippine B2B SaaS company (fleet management software, 120 active clients) averaging 3โ€“4 P1 incidents per month, managing incidents through WhatsApp, manually created Jira tickets, and a CTO manually triggering PagerDuty pages. Post-mortems written for fewer than 25% of incidents, causing recurring root causes to go unaddressed.


Results & Impact

Metric Before After
Mean time to detect (MTTD) 8 minutes average โ€” monitoring email inbox, no real-time alerting Under 45 seconds โ€” UptimeRobot webhook fires instantly on detection
Mean time to respond (MTTR) 47 minutes average across Q2 P1 incidents 22 minutes average in the 3 months post-deployment โ€” 53% reduction
Jira ticket creation Manual during incident โ€” 40% of P1s had no ticket in Q2 100% โ€” ticket created automatically before the engineer reads the Slack message
On-call paging Manual PagerDuty trigger by CTO โ€” off-hours incidents undetected until morning Automatic PagerDuty trigger with dedup_key โ€” zero undetected off-hours incidents since deployment
Slack incident broadcast WhatsApp thread with inconsistent information Structured Slack message in #incidents with ID, severity, service, environment, and host โ€” every time
P1 vs. non-P1 routing All alerts treated the same โ€” senior engineers paged for P3 performance blips P1 gets full three-action response; P3 creates an OPS bug ticket, no page
Alert fatigue incidents 3 engineers reported ignoring pages in Q2 due to false urgency Zero reported since severity-based routing eliminated P3 pages to on-call
Post-mortem completion rate 2 of 9 P1 incidents documented (22%) in the 6 months before 11 of 11 P1 incidents documented (100%) in 3 months post-deployment
Post-mortem time to produce 90 minutes average when written โ€” usually skipped for lack of time Under 40 seconds โ€” GPT-4o generates on /incident-resolved trigger
Recurring root cause incidents 3 incidents in Q2 traced to the same database connection pool issue โ€” no post-mortem existed from the first occurrence Zero recurring root causes in Q3 โ€” action items from post-mortems tracked in Jira and closed
Client impact notifications Ad hoc โ€” the CTO emailed affected clients manually when he remembered Post-mortem action items now include a client communication step โ€” 100% of P1s with client impact resulted in a formal communication
Engineering team confidence “We were always in firefighting mode” (CTO, Q2 retrospective) Structured response means engineers join an incident with full context already in Slack

Industry context: DevOps automation is one of the highest-value automation niches. A $50K/year enterprise incident management platform delivers the same core functionality this pipeline provides โ€” Jira ticket creation, PagerDuty paging, Slack broadcast, and post-mortem generation โ€” for a fraction of the cost, built on infrastructure the team already owns and understands.


Technical Details

Tech Stack: n8n ยท Webhook ยท Jira ยท Slack ยท PagerDuty (Events API v2) ยท OpenAI GPT-4o ยท Gmail ยท JavaScript

How each tool is used:

  • n8n โ€” Two independent webhook-triggered sub-workflows: incident response and post-mortem generation
  • Webhook (/incident-alert) โ€” Generic endpoint accepting POST from UptimeRobot (configured under Integrations โ†’ Webhook) and Grafana alerting rules โ€” no monitoring tool change required, just a new notification URL
  • JavaScript (parse & enrich) โ€” Normalizes alert payloads across UptimeRobot and Grafana formats using field fallback chains (a.title || a.monitor_name || 'Alert Triggered'), maps severity strings and P-codes to a consistent internal format, generates a timestamped incident ID (INC-${Date.now()}), sets is_critical boolean for routing
  • IF node โ€” Routes is_critical: true (P1 or critical) to the full three-action response; all other severities route to the non-P1 standard bug path
  • Jira (P1 โ€” Incident) โ€” Creates an Incident issue type in the INC project with severity-mapped priority (Highest/High/Medium), incident description with all context fields, and auto-labels for incident, severity level, and service name
  • Slack โ€” Broadcasts to #incidents with incident ID, severity, service, environment, host, timestamp, description, and confirmation that Jira and PagerDuty have fired โ€” engineers join the channel with full context already visible
  • HTTP Request (PagerDuty) โ€” Posts to PagerDuty Events API v2 /enqueue with event_action: trigger, dedup_key set to the incident ID (prevents duplicate pages during flapping alerts), severity, source, component, and group fields mapped correctly
  • Jira (non-P1 โ€” Bug) โ€” Creates a standard Bug in the OPS project with Medium priority and monitoring labels โ€” P3 performance issues tracked without pulling the on-call rotation
  • Respond to Webhook โ€” Returns {status: acknowledged, incident_id} to UptimeRobot/Grafana confirming receipt
  • Webhook (/incident-resolved) โ€” Triggered via a Jira automation rule that fires when an INC ticket status moves to “Resolved” โ€” fully automatic, no manual POST required
  • OpenAI GPT-4o โ€” Blameless SRE post-mortem system prompt at temperature 0.3; generates a complete structured document: Summary, Timeline, Root Cause Analysis, Impact Assessment, and Action Items with owners and deadlines
  • Gmail โ€” Emails the full post-mortem to engineering@ and cto@ with the incident title and duration in the subject line โ€” arrives within 90 seconds of the Jira status change

Workflow architecture (two independent sub-workflows):

Sub-workflow 1 (incident response): Webhook (/incident-alert) โ†’ JS Parse & Enrich โ†’ IF P1 Critical โ†’ [True: Jira Incident + Slack Broadcast + PagerDuty Page in parallel] / [False: Jira Bug] โ†’ Respond Acknowledged

Sub-workflow 2 (post-mortem): Webhook (/incident-resolved) โ†’ GPT-4o Generate Post-Mortem โ†’ Gmail Email to Engineering โ†’ Respond Post-Mortem Sent

Complexity highlights:

  • Three-way parallel P1 response โ€” Jira, Slack, and PagerDuty all fire simultaneously, meaning the ticket exists, the team is notified, and the on-call engineer is paged in the same second โ€” the WhatsApp replacement that previously took 8 minutes of manual coordination happens in under 45 seconds
  • Cross-format alert normalization โ€” the JavaScript node handles both UptimeRobot’s monitorFriendlyName field and Grafana’s ruleName field using the same fallback chain, making the webhook compatible with both monitoring sources without separate workflows
  • PagerDuty dedup_key โ€” set to the internal incident ID, preventing PagerDuty from creating multiple incidents during the flapping events that were previously causing 3โ€“4 duplicate pages per incident and further eroding engineer trust in the alerting system
  • Jira automation trigger on post-mortem โ€” the /incident-resolved webhook is called automatically by a Jira automation rule (Status changed to “Resolved” โ†’ POST to webhook URL), removing the manual step entirely and achieving 100% post-mortem completion without requiring any engineer action beyond closing the ticket
  • Severity-based routing โ€” P3 alerts that previously triggered the same all-hands response as P1 outages now create a silent OPS bug ticket with no page, directly addressing the alert fatigue problem that caused engineers to ignore legitimate P1 pages in Q2
  • Blameless post-mortem by prompt design โ€” the GPT-4o system prompt explicitly prohibits attributing failures to individuals, producing SRE-standard blameless output that the engineering team actually reads and acts on rather than filing away

Context & Social Proof

  • Build timeline: 4 days โ€” Day 1: UptimeRobot and Grafana webhook payload analysis and JavaScript normalization. Day 2: Jira dual-project configuration, PagerDuty Events API v2 integration, and Slack Block formatting. Day 3: GPT-4o post-mortem prompt engineering tested against 5 real historical incidents. Day 4: Jira automation rule for auto-triggering post-mortem, end-to-end live testing with a simulated P1, and CTO walkthrough
  • Your role: Solo build โ€” alert payload normalization across two monitoring tools, severity routing logic, Jira dual-project configuration (INC and OPS), PagerDuty Events API v2 with dedup, Slack incident broadcast formatting, GPT-4o blameless post-mortem prompt tuned against real incident data, Jira automation trigger for automatic post-mortem generation, and full lifecycle testing
  • Deployment: n8n cloud; two webhook URLs โ€” one added to UptimeRobot notifications and one to Grafana alerting rules. PagerDuty routing key and Jira project keys are the only client-specific configuration. Zero changes to the engineering team’s existing tools beyond adding the webhook URLs
  • Client quote: “We were managing incidents through WhatsApp. Someone would post ‘site is down,’ and then 10 messages later we’d still be figuring out who was handling it. Now Slack has everything โ€” the ticket number, what’s affected, who’s paged โ€” before I’ve even opened my laptop. And we actually have post-mortems now. Real ones, with action items that get closed.” โ€” CTO, fleet management SaaS, Philippines
  • Reusability: Severity mapping, Jira project keys, Slack channel, PagerDuty routing key, and post-mortem email recipients are the only parameters that change per client deployment. The Jira automation rule pattern for automatic post-mortem triggering works identically across any Jira Cloud instance

Use Cases & Ideal Buyer

Best fit for:

  • SaaS companies with a production service and an engineering team currently handling incidents through WhatsApp, email threads, or ad hoc Slack messages with no structured response process
  • CTOs who have personally triggered PagerDuty pages manually and know the system breaks the moment they’re unavailable
  • Engineering teams where post-mortems don’t happen because writing them takes too long after a draining incident
  • Startups that need enterprise-grade incident response but can’t justify a $50K/year platform for a 3โ€“6 person engineering team

Can also be adapted for:

  • Multi-service routing โ€” branch on the service field to page different on-call rotations for different microservices or product areas
  • Severity escalation โ€” add a scheduled node that checks for P1 incidents unacknowledged in PagerDuty after 5 minutes and escalates to a secondary on-call engineer
  • Incident channel creation โ€” add a Slack API call to create a dedicated #inc-{incident_id} channel for each P1 and invite the on-call team automatically
  • Bi-directional PagerDuty โ€” the Jira automation trigger pattern already handles this; a PagerDuty resolution webhook can be added as a third trigger for the post-mortem sub-workflow, making resolution tracking fully bidirectional