AI Incident Response

AI-assisted incident response with dynamic runbook generation, root cause analysis, and automated remediation suggestions.

Overview

The AI incident response module helps operations teams respond faster to infrastructure incidents. When an incident is triggered (via PagerDuty, webhook, or manual creation), Knowledge Tree generates context-specific runbooks, performs root cause analysis using graph data, and suggests remediation steps based on historical patterns.

Runbook generation

Unlike static runbooks, Knowledge Tree generates dynamic runbooks tailored to the specific incident context:

  • Resource context -- runbook includes details about the affected resource from the graph
  • Dependency information -- what else is affected and what to check
  • Recent changes -- what changed on the resource in the last 24 hours
  • Checklist -- step-by-step investigation and mitigation steps
  • Escalation path -- who to contact based on resource ownership
# Example: dynamic runbook for an RDS failure incident
Incident: RDS instance "prod-db-1" is experiencing elevated latency

1. 🔍 Verify the current state
   - Check RDS metrics in Datadog
   - Review recent CloudWatch alarms
   - Confirm the resource is still in the knowledge graph

2. 🔄 Check recent changes
   - Security group changes in last 24h: none
   - Parameter group changes in last 24h: none
   - Schema migration in last 24h: 1 migration (2h ago)

3. 📊 Review dependencies
   - Connected to: payment-api, user-service, order-service
   - Blast radius: 3 critical services

4. 🛠️ Remediation steps
   - Scale up instance class (runs: modify-db-instance)
   - Increase allocated storage (runs: modify-db-storage)
   - Failover to replica (runs: failover-db-instance)

5. 📋 Post-mitigation
   - Verify all dependent services recovered
   - Update incident timeline in PagerDuty

Root cause analysis

The RCA engine combines graph data with LLM reasoning to identify probable root causes:

  • Change correlation -- correlate the incident with recent changes on the resource or its dependencies
  • Anomaly context -- check if any anomalies were detected on the resource prior to the incident
  • Dependency cascade -- trace whether the incident originated upstream or downstream
  • Similar incidents -- compare with historical incidents on similar resources
  • Confidence scoring -- each potential root cause is assigned a confidence score

Remediation suggestions

Based on the root cause analysis and historical patterns, the system suggests remediation actions:

Action typeExample
Automated fixExecute a known remediation script via webhook
Manual stepScale up the instance through the cloud console
RollbackRoll back the last change detected on the resource
WorkaroundFail over to the standby replica while investigating
EscalationRoute to the platform team for further investigation
Learning from incidents
Each incident and its resolution are stored in the knowledge graph. Over time, the system learns which remediation actions are most effective for different incident patterns.

Incident timeline

The incident timeline automatically reconstructs the sequence of events leading up to and following an incident:

  • Pre-incident -- resource changes, anomaly detections, and metric deviations
  • Incident detection -- alert source, first detection time, severity
  • Response actions -- who was paged, when they acknowledged, what actions were taken
  • Resolution -- what fixed the issue, when normal operation resumed

Postmortem

After an incident is resolved, Knowledge Tree can generate a postmortem document:

  • Timeline summary -- key events with timestamps
  • Root cause -- identified root cause with supporting evidence
  • Impact assessment -- blast radius, affected users, duration
  • Action items -- preventive measures to avoid recurrence
  • Follow-up tickets -- auto-created Jira tickets for each action item