Skip to content

Incident Response

LevelDescriptionResponse TimeExample
SEV1Platform down, all customers affectedImmediateDatabase outage, control plane crash
SEV2Major degradation, many customers affected15 minAnalysis engine failures, agent disconnects at scale
SEV3Partial degradation, some customers affected1 hourSingle-tenant issues, intermittent errors
SEV4Minor issue, workaround availableNext business dayUI glitches, non-critical feature broken
  • Grafana alerts for error rate spikes, latency increases, pod restarts
  • Health endpoint monitoring
  • Agent connectivity SLO tracking
  • Customer support tickets
  • MCP investigation finding systemic issues:
    MCP: get_system_health
    MCP: get_fleet_summary
  1. Assess scope: How many customers affected?

    MCP: get_fleet_summary
  2. Identify component: Which service is degraded?

    MCP: get_system_health
  3. Check recent changes: Was anything deployed recently?

    Terminal window
    # Check deploy history
    helm history pulsestream-core -n pulsestream-core-prod
    helm history pulsestream-agent -n pulsestream-agent-prod
  4. Declare severity and notify

ActionCommandWhen
Rollback deployhelm rollback <release> <rev> -n <ns>Bad deploy caused the issue
Scale downpulsar scale down --env prodNeed to stop bleeding
Restart servicekubectl rollout restart deployment/<svc> -n <ns>Stuck process / memory leak
  1. Check provider status page
  2. Analysis engine will queue work with exponential backoff
  3. Customer-facing: analyses will complete once provider recovers
  4. No PulseStream action needed unless outage exceeds 1 hour

After every SEV1/SEV2:

  1. Document timeline (when detected, when mitigated, when resolved)
  2. Root cause analysis
  3. Action items to prevent recurrence
  4. Update runbooks if a new failure mode was discovered