AI Incident Response

AI-assisted incident response with dynamic runbook generation, root cause analysis, and automated remediation suggestions.

Overview

The AI incident response module helps operations teams respond faster to infrastructure incidents. When an incident is triggered (via PagerDuty, webhook, or manual creation), Knowledge Tree generates context-specific runbooks, performs root cause analysis using graph data, and suggests remediation steps based on historical patterns.

Runbook generation

Unlike static runbooks, Knowledge Tree generates dynamic runbooks tailored to the specific incident context:

Resource context -- runbook includes details about the affected resource from the graph
Dependency information -- what else is affected and what to check
Recent changes -- what changed on the resource in the last 24 hours
Checklist -- step-by-step investigation and mitigation steps
Escalation path -- who to contact based on resource ownership

# Example: dynamic runbook for an RDS failure incident
Incident: RDS instance "prod-db-1" is experiencing elevated latency

1. 🔍 Verify the current state
   - Check RDS metrics in Datadog
   - Review recent CloudWatch alarms
   - Confirm the resource is still in the knowledge graph

2. 🔄 Check recent changes
   - Security group changes in last 24h: none
   - Parameter group changes in last 24h: none
   - Schema migration in last 24h: 1 migration (2h ago)

3. 📊 Review dependencies
   - Connected to: payment-api, user-service, order-service
   - Blast radius: 3 critical services

4. 🛠️ Remediation steps
   - Scale up instance class (runs: modify-db-instance)
   - Increase allocated storage (runs: modify-db-storage)
   - Failover to replica (runs: failover-db-instance)

5. 📋 Post-mitigation
   - Verify all dependent services recovered
   - Update incident timeline in PagerDuty

Root cause analysis

The RCA engine combines graph data with LLM reasoning to identify probable root causes:

Change correlation -- correlate the incident with recent changes on the resource or its dependencies
Anomaly context -- check if any anomalies were detected on the resource prior to the incident
Dependency cascade -- trace whether the incident originated upstream or downstream
Similar incidents -- compare with historical incidents on similar resources
Confidence scoring -- each potential root cause is assigned a confidence score

Remediation suggestions

Based on the root cause analysis and historical patterns, the system suggests remediation actions:

Action type	Example
Automated fix	Execute a known remediation script via webhook
Manual step	Scale up the instance through the cloud console
Rollback	Roll back the last change detected on the resource
Workaround	Fail over to the standby replica while investigating
Escalation	Route to the platform team for further investigation

Learning from incidents

Each incident and its resolution are stored in the knowledge graph. Over time, the system learns which remediation actions are most effective for different incident patterns.

Incident timeline

The incident timeline automatically reconstructs the sequence of events leading up to and following an incident:

Pre-incident -- resource changes, anomaly detections, and metric deviations
Incident detection -- alert source, first detection time, severity
Response actions -- who was paged, when they acknowledged, what actions were taken
Resolution -- what fixed the issue, when normal operation resumed

Postmortem

After an incident is resolved, Knowledge Tree can generate a postmortem document:

Timeline summary -- key events with timestamps
Root cause -- identified root cause with supporting evidence
Impact assessment -- blast radius, affected users, duration
Action items -- preventive measures to avoid recurrence
Follow-up tickets -- auto-created Jira tickets for each action item