AI Incident Response
AI-assisted incident response with dynamic runbook generation, root cause analysis, and automated remediation suggestions.
Overview
The AI incident response module helps operations teams respond faster to infrastructure incidents. When an incident is triggered (via PagerDuty, webhook, or manual creation), Knowledge Tree generates context-specific runbooks, performs root cause analysis using graph data, and suggests remediation steps based on historical patterns.
Runbook generation
Unlike static runbooks, Knowledge Tree generates dynamic runbooks tailored to the specific incident context:
- Resource context -- runbook includes details about the affected resource from the graph
- Dependency information -- what else is affected and what to check
- Recent changes -- what changed on the resource in the last 24 hours
- Checklist -- step-by-step investigation and mitigation steps
- Escalation path -- who to contact based on resource ownership
# Example: dynamic runbook for an RDS failure incident
Incident: RDS instance "prod-db-1" is experiencing elevated latency
1. 🔍 Verify the current state
- Check RDS metrics in Datadog
- Review recent CloudWatch alarms
- Confirm the resource is still in the knowledge graph
2. 🔄 Check recent changes
- Security group changes in last 24h: none
- Parameter group changes in last 24h: none
- Schema migration in last 24h: 1 migration (2h ago)
3. 📊 Review dependencies
- Connected to: payment-api, user-service, order-service
- Blast radius: 3 critical services
4. 🛠️ Remediation steps
- Scale up instance class (runs: modify-db-instance)
- Increase allocated storage (runs: modify-db-storage)
- Failover to replica (runs: failover-db-instance)
5. 📋 Post-mitigation
- Verify all dependent services recovered
- Update incident timeline in PagerDutyRoot cause analysis
The RCA engine combines graph data with LLM reasoning to identify probable root causes:
- Change correlation -- correlate the incident with recent changes on the resource or its dependencies
- Anomaly context -- check if any anomalies were detected on the resource prior to the incident
- Dependency cascade -- trace whether the incident originated upstream or downstream
- Similar incidents -- compare with historical incidents on similar resources
- Confidence scoring -- each potential root cause is assigned a confidence score
Remediation suggestions
Based on the root cause analysis and historical patterns, the system suggests remediation actions:
| Action type | Example |
|---|---|
| Automated fix | Execute a known remediation script via webhook |
| Manual step | Scale up the instance through the cloud console |
| Rollback | Roll back the last change detected on the resource |
| Workaround | Fail over to the standby replica while investigating |
| Escalation | Route to the platform team for further investigation |
Incident timeline
The incident timeline automatically reconstructs the sequence of events leading up to and following an incident:
- Pre-incident -- resource changes, anomaly detections, and metric deviations
- Incident detection -- alert source, first detection time, severity
- Response actions -- who was paged, when they acknowledged, what actions were taken
- Resolution -- what fixed the issue, when normal operation resumed
Postmortem
After an incident is resolved, Knowledge Tree can generate a postmortem document:
- Timeline summary -- key events with timestamps
- Root cause -- identified root cause with supporting evidence
- Impact assessment -- blast radius, affected users, duration
- Action items -- preventive measures to avoid recurrence
- Follow-up tickets -- auto-created Jira tickets for each action item