Incident Response
Incident Response
Section titled “Incident Response”Severity Levels
Section titled “Severity Levels”| Level | Description | Response Time | Example |
|---|---|---|---|
| SEV1 | Platform down, all customers affected | Immediate | Database outage, control plane crash |
| SEV2 | Major degradation, many customers affected | 15 min | Analysis engine failures, agent disconnects at scale |
| SEV3 | Partial degradation, some customers affected | 1 hour | Single-tenant issues, intermittent errors |
| SEV4 | Minor issue, workaround available | Next business day | UI glitches, non-critical feature broken |
Detection
Section titled “Detection”Automated
Section titled “Automated”- Grafana alerts for error rate spikes, latency increases, pod restarts
- Health endpoint monitoring
- Agent connectivity SLO tracking
Manual
Section titled “Manual”- Customer support tickets
- MCP investigation finding systemic issues:
MCP: get_system_healthMCP: get_fleet_summary
Triage (First 5 Minutes)
Section titled “Triage (First 5 Minutes)”-
Assess scope: How many customers affected?
MCP: get_fleet_summary -
Identify component: Which service is degraded?
MCP: get_system_health -
Check recent changes: Was anything deployed recently?
Terminal window # Check deploy historyhelm history pulsestream-core -n pulsestream-core-prodhelm history pulsestream-agent -n pulsestream-agent-prod -
Declare severity and notify
Mitigation
Section titled “Mitigation”Quick Mitigation Options
Section titled “Quick Mitigation Options”| Action | Command | When |
|---|---|---|
| Rollback deploy | helm rollback <release> <rev> -n <ns> | Bad deploy caused the issue |
| Scale down | pulsar scale down --env prod | Need to stop bleeding |
| Restart service | kubectl rollout restart deployment/<svc> -n <ns> | Stuck process / memory leak |
LLM Provider Outage
Section titled “LLM Provider Outage”- Check provider status page
- Analysis engine will queue work with exponential backoff
- Customer-facing: analyses will complete once provider recovers
- No PulseStream action needed unless outage exceeds 1 hour
Post-Mortem
Section titled “Post-Mortem”After every SEV1/SEV2:
- Document timeline (when detected, when mitigated, when resolved)
- Root cause analysis
- Action items to prevent recurrence
- Update runbooks if a new failure mode was discovered