Incident Response

Severity Levels

Level	Description	Response Time	Example
SEV1	Platform down, all customers affected	Immediate	Database outage, control plane crash
SEV2	Major degradation, many customers affected	15 min	Analysis engine failures, agent disconnects at scale
SEV3	Partial degradation, some customers affected	1 hour	Single-tenant issues, intermittent errors
SEV4	Minor issue, workaround available	Next business day	UI glitches, non-critical feature broken

MCP investigation finding systemic issues:

MCP: get_system_health
MCP: get_fleet_summary

Assess scope: How many customers affected?
```
MCP: get_fleet_summary
```
Identify component: Which service is degraded?
```
MCP: get_system_health
```

Check recent changes: Was anything deployed recently?

# Check deploy history
helm history pulsestream-core -n pulsestream-core-prod
helm history pulsestream-agent -n pulsestream-agent-prod

Action	Command	When
Rollback deploy	`helm rollback <release> <rev> -n <ns>`	Bad deploy caused the issue
Scale down	`pulsar scale down --env prod`	Need to stop bleeding
Restart service	`kubectl rollout restart deployment/<svc> -n <ns>`	Stuck process / memory leak

After every SEV1/SEV2: