
๐จ IT Incident Response & Alerting System
Category: DevOps / IT | Difficulty: Advanced
Description: A two-part incident management system that handles the full lifecycle of a production alert โ from detection to post-mortem. The first sub-workflow receives alerts from any monitoring tool via webhook, classifies severity, and for P1/critical incidents simultaneously creates a Jira incident ticket, broadcasts to the #incidents Slack channel, and pages the on-call engineer via PagerDuty โ all three in parallel, within seconds of the alert firing. Non-critical alerts are routed to a standard Jira bug in the OPS project instead. The second sub-workflow triggers on incident resolution: it receives the incident details, generates a complete blameless post-mortem using GPT-4o โ with summary, timeline, root cause, impact, and action items โ and emails it to engineering leadership. Mean time to detect drops to under a minute. Mean time to respond drops because the right people are paged, the ticket exists, and the team is already in Slack before anyone manually opens a laptop.
The Problem
When production goes down, every second of confusion is a second of lost revenue, degraded user trust, and compounding damage. Most engineering teams without an incident response system spend the first 5โ15 minutes of an outage doing exactly the wrong things โ someone notices an error, pings a colleague, someone else tries to figure out who’s on call, a Jira ticket gets created manually, a Slack message gets typed, and nobody has the full picture yet. After the incident, post-mortems either don’t happen or get written days later from memory, missing critical details and producing vague action items nobody follows up on.
Key pain points:
- Incidents reported through a WhatsApp group โ engineers had to read the thread to understand what was down, who was handling it, and what the status was, often joining an already 20-minute-old conversation mid-way through
- Average MTTD of 8 minutes โ alerts went to one monitoring email inbox that no one was watching in real time, requiring a human to notice, read, and escalate manually
- Jira tickets created during the incident by whoever had a browser open โ 40% of P1 incidents in Q2 had no Jira ticket at all because everyone was focused on fixing the issue
- PagerDuty existed but was triggered manually by the CTO sending a page โ off-hours incidents went undetected until someone checked their phone at 7AM
- Post-mortems written for 2 of 9 P1 incidents in the 6 months before deployment โ the other 7 were never documented, meaning the same root causes recurred across multiple incidents
- P3 alerts (minor performance degradations) handled with the same urgency as P1 outages โ senior engineers paged for issues that could wait until morning, causing alert fatigue and ignored pages
The Solution
A two-webhook incident management pipeline built on n8n. The first webhook โ /incident-alert โ accepts POST payloads from UptimeRobot (the team’s existing monitoring tool) and from a custom Grafana alerting rule. A JavaScript enrichment node normalizes the alert payload, maps severity strings and priority codes to a consistent internal format, generates a unique incident ID, and sets an is_critical boolean for P1/critical alerts. An IF node then routes: P1/critical triggers three simultaneous actions โ a Jira Incident ticket in the INC project with severity-mapped priority and labels, a formatted Slack broadcast to #incidents with all incident context, and a PagerDuty trigger via the Events API v2 with a dedup_key matching the incident ID to prevent duplicate pages. Non-P1 alerts create a standard Bug in the OPS project instead. The webhook responds with {status: acknowledged, incident_id} confirming receipt.
The second webhook โ /incident-resolved โ fires when the incident is closed. It receives the incident ID, title, service, duration, root cause, and impact, passes them to GPT-4o with a blameless SRE post-mortem system prompt at temperature 0.3, and emails the complete structured document to engineering and the CTO.
Who it was built for: A 4-person engineering team at a Philippine B2B SaaS company (fleet management software, 120 active clients) averaging 3โ4 P1 incidents per month, managing incidents through WhatsApp, manually created Jira tickets, and a CTO manually triggering PagerDuty pages. Post-mortems written for fewer than 25% of incidents, causing recurring root causes to go unaddressed.
Results & Impact
| Metric | Before | After |
|---|---|---|
| Mean time to detect (MTTD) | 8 minutes average โ monitoring email inbox, no real-time alerting | Under 45 seconds โ UptimeRobot webhook fires instantly on detection |
| Mean time to respond (MTTR) | 47 minutes average across Q2 P1 incidents | 22 minutes average in the 3 months post-deployment โ 53% reduction |
| Jira ticket creation | Manual during incident โ 40% of P1s had no ticket in Q2 | 100% โ ticket created automatically before the engineer reads the Slack message |
| On-call paging | Manual PagerDuty trigger by CTO โ off-hours incidents undetected until morning | Automatic PagerDuty trigger with dedup_key โ zero undetected off-hours incidents since deployment |
| Slack incident broadcast | WhatsApp thread with inconsistent information | Structured Slack message in #incidents with ID, severity, service, environment, and host โ every time |
| P1 vs. non-P1 routing | All alerts treated the same โ senior engineers paged for P3 performance blips | P1 gets full three-action response; P3 creates an OPS bug ticket, no page |
| Alert fatigue incidents | 3 engineers reported ignoring pages in Q2 due to false urgency | Zero reported since severity-based routing eliminated P3 pages to on-call |
| Post-mortem completion rate | 2 of 9 P1 incidents documented (22%) in the 6 months before | 11 of 11 P1 incidents documented (100%) in 3 months post-deployment |
| Post-mortem time to produce | 90 minutes average when written โ usually skipped for lack of time | Under 40 seconds โ GPT-4o generates on /incident-resolved trigger |
| Recurring root cause incidents | 3 incidents in Q2 traced to the same database connection pool issue โ no post-mortem existed from the first occurrence | Zero recurring root causes in Q3 โ action items from post-mortems tracked in Jira and closed |
| Client impact notifications | Ad hoc โ the CTO emailed affected clients manually when he remembered | Post-mortem action items now include a client communication step โ 100% of P1s with client impact resulted in a formal communication |
| Engineering team confidence | “We were always in firefighting mode” (CTO, Q2 retrospective) | Structured response means engineers join an incident with full context already in Slack |
Industry context: DevOps automation is one of the highest-value automation niches. A $50K/year enterprise incident management platform delivers the same core functionality this pipeline provides โ Jira ticket creation, PagerDuty paging, Slack broadcast, and post-mortem generation โ for a fraction of the cost, built on infrastructure the team already owns and understands.
Technical Details
Tech Stack: n8n ยท Webhook ยท Jira ยท Slack ยท PagerDuty (Events API v2) ยท OpenAI GPT-4o ยท Gmail ยท JavaScript
How each tool is used:
- n8n โ Two independent webhook-triggered sub-workflows: incident response and post-mortem generation
- Webhook (/incident-alert) โ Generic endpoint accepting POST from UptimeRobot (configured under Integrations โ Webhook) and Grafana alerting rules โ no monitoring tool change required, just a new notification URL
- JavaScript (parse & enrich) โ Normalizes alert payloads across UptimeRobot and Grafana formats using field fallback chains (
a.title || a.monitor_name || 'Alert Triggered'), maps severity strings and P-codes to a consistent internal format, generates a timestamped incident ID (INC-${Date.now()}), setsis_criticalboolean for routing - IF node โ Routes
is_critical: true(P1 or critical) to the full three-action response; all other severities route to the non-P1 standard bug path - Jira (P1 โ Incident) โ Creates an Incident issue type in the INC project with severity-mapped priority (Highest/High/Medium), incident description with all context fields, and auto-labels for
incident, severity level, and service name - Slack โ Broadcasts to #incidents with incident ID, severity, service, environment, host, timestamp, description, and confirmation that Jira and PagerDuty have fired โ engineers join the channel with full context already visible
- HTTP Request (PagerDuty) โ Posts to PagerDuty Events API v2
/enqueuewithevent_action: trigger,dedup_keyset to the incident ID (prevents duplicate pages during flapping alerts), severity, source, component, and group fields mapped correctly - Jira (non-P1 โ Bug) โ Creates a standard Bug in the OPS project with Medium priority and monitoring labels โ P3 performance issues tracked without pulling the on-call rotation
- Respond to Webhook โ Returns
{status: acknowledged, incident_id}to UptimeRobot/Grafana confirming receipt - Webhook (/incident-resolved) โ Triggered via a Jira automation rule that fires when an INC ticket status moves to “Resolved” โ fully automatic, no manual POST required
- OpenAI GPT-4o โ Blameless SRE post-mortem system prompt at temperature 0.3; generates a complete structured document: Summary, Timeline, Root Cause Analysis, Impact Assessment, and Action Items with owners and deadlines
- Gmail โ Emails the full post-mortem to
engineering@andcto@with the incident title and duration in the subject line โ arrives within 90 seconds of the Jira status change
Workflow architecture (two independent sub-workflows):
Sub-workflow 1 (incident response): Webhook (/incident-alert) โ JS Parse & Enrich โ IF P1 Critical โ [True: Jira Incident + Slack Broadcast + PagerDuty Page in parallel] / [False: Jira Bug] โ Respond Acknowledged
Sub-workflow 2 (post-mortem): Webhook (/incident-resolved) โ GPT-4o Generate Post-Mortem โ Gmail Email to Engineering โ Respond Post-Mortem Sent
Complexity highlights:
- Three-way parallel P1 response โ Jira, Slack, and PagerDuty all fire simultaneously, meaning the ticket exists, the team is notified, and the on-call engineer is paged in the same second โ the WhatsApp replacement that previously took 8 minutes of manual coordination happens in under 45 seconds
- Cross-format alert normalization โ the JavaScript node handles both UptimeRobot’s
monitorFriendlyNamefield and Grafana’sruleNamefield using the same fallback chain, making the webhook compatible with both monitoring sources without separate workflows - PagerDuty
dedup_keyโ set to the internal incident ID, preventing PagerDuty from creating multiple incidents during the flapping events that were previously causing 3โ4 duplicate pages per incident and further eroding engineer trust in the alerting system - Jira automation trigger on post-mortem โ the
/incident-resolvedwebhook is called automatically by a Jira automation rule (Status changed to “Resolved” โ POST to webhook URL), removing the manual step entirely and achieving 100% post-mortem completion without requiring any engineer action beyond closing the ticket - Severity-based routing โ P3 alerts that previously triggered the same all-hands response as P1 outages now create a silent OPS bug ticket with no page, directly addressing the alert fatigue problem that caused engineers to ignore legitimate P1 pages in Q2
- Blameless post-mortem by prompt design โ the GPT-4o system prompt explicitly prohibits attributing failures to individuals, producing SRE-standard blameless output that the engineering team actually reads and acts on rather than filing away
Context & Social Proof
- Build timeline: 4 days โ Day 1: UptimeRobot and Grafana webhook payload analysis and JavaScript normalization. Day 2: Jira dual-project configuration, PagerDuty Events API v2 integration, and Slack Block formatting. Day 3: GPT-4o post-mortem prompt engineering tested against 5 real historical incidents. Day 4: Jira automation rule for auto-triggering post-mortem, end-to-end live testing with a simulated P1, and CTO walkthrough
- Your role: Solo build โ alert payload normalization across two monitoring tools, severity routing logic, Jira dual-project configuration (INC and OPS), PagerDuty Events API v2 with dedup, Slack incident broadcast formatting, GPT-4o blameless post-mortem prompt tuned against real incident data, Jira automation trigger for automatic post-mortem generation, and full lifecycle testing
- Deployment: n8n cloud; two webhook URLs โ one added to UptimeRobot notifications and one to Grafana alerting rules. PagerDuty routing key and Jira project keys are the only client-specific configuration. Zero changes to the engineering team’s existing tools beyond adding the webhook URLs
- Client quote: “We were managing incidents through WhatsApp. Someone would post ‘site is down,’ and then 10 messages later we’d still be figuring out who was handling it. Now Slack has everything โ the ticket number, what’s affected, who’s paged โ before I’ve even opened my laptop. And we actually have post-mortems now. Real ones, with action items that get closed.” โ CTO, fleet management SaaS, Philippines
- Reusability: Severity mapping, Jira project keys, Slack channel, PagerDuty routing key, and post-mortem email recipients are the only parameters that change per client deployment. The Jira automation rule pattern for automatic post-mortem triggering works identically across any Jira Cloud instance
Use Cases & Ideal Buyer
Best fit for:
- SaaS companies with a production service and an engineering team currently handling incidents through WhatsApp, email threads, or ad hoc Slack messages with no structured response process
- CTOs who have personally triggered PagerDuty pages manually and know the system breaks the moment they’re unavailable
- Engineering teams where post-mortems don’t happen because writing them takes too long after a draining incident
- Startups that need enterprise-grade incident response but can’t justify a $50K/year platform for a 3โ6 person engineering team
Can also be adapted for:
- Multi-service routing โ branch on the
servicefield to page different on-call rotations for different microservices or product areas - Severity escalation โ add a scheduled node that checks for P1 incidents unacknowledged in PagerDuty after 5 minutes and escalates to a secondary on-call engineer
- Incident channel creation โ add a Slack API call to create a dedicated
#inc-{incident_id}channel for each P1 and invite the on-call team automatically - Bi-directional PagerDuty โ the Jira automation trigger pattern already handles this; a PagerDuty resolution webhook can be added as a third trigger for the post-mortem sub-workflow, making resolution tracking fully bidirectional
