Severity levels
| Level | Examples | Response time |
|---|---|---|
| P0 — Critical | Node down, tombstoned, signing failure | Immediate |
| P1 — High | Jailed, catching up, low peers | Within 1 hour |
| P2 — Medium | High memory/CPU, missed blocks trending up | Within 4 hours |
| P3 — Low | Non-critical log errors, configuration drift | Next maintenance window |
P0: Node not signing blocks
Detect:- Check if the service is running:
sudo systemctl status autheod - If stopped, restart:
sudo systemctl start autheod - Check logs:
sudo journalctl -u autheod -n 200 --no-pager - Check sync status:
autheod status | jq '.SyncInfo' - If out of sync, restore from snapshot (see Backups and restore)
P0: Validator tombstoned
Detect:tombstoned: true if tombstoned.
Response:
Tombstoning is permanent — it results from double-signing and cannot be undone.
- Stop the tombstoned node immediately
- Do not attempt unjail — it will fail
- Commission a new server
- Generate a new consensus key:
autheod init new-validator --chain-id autheo_2127-1 - Register a new validator with
MsgCreateValidatorand a new consensus key - Bind your Sovereign license to the new validator address
P1: Validator jailed (liveness)
Detect:Diagnose the cause
Check why blocks were missed — look for crashes, restarts, or network interruptions in the logs:
Fix the root cause
Resolve the underlying issue before unjailing: disk full, OOM, misconfiguration, etc.
P1: Node not syncing (catching_up: true)
Detect:- Check peer count:
curl -s localhost:26657/net_info | jq '.result.n_peers' - If peers < 3, add persistent peers in
config/config.toml - If syncing is extremely slow (hours behind), restore from snapshot:
P1: Hardware failure — migrate to new host
See Runbook B: Full hardware failure. The critical rule: confirm the old host is completely powered off before starting the new host with the same consensus key.Post-incident review
After every P0 or P1 incident:- Document the timeline (when detected, when resolved)
- Identify the root cause
- Update monitoring thresholds if the incident was not caught early enough
- Review the monitoring checklist
- Update runbooks if the incident revealed a gap