SPEC-0016: Session-Based Escalation with Structured Handoff
Overview
Replace in-process Task tool escalation with a file-based handoff mechanism where each tier runs as a separate Claude CLI process with its own session record. The Go supervisor controls escalation decisions, reads structured handoff files written by exiting tiers, and spawns the next tier as a new CLI process with handoff context injected via --append-system-prompt. See ADR-0016.
Requirements
Requirement: Handoff File Format
The handoff file MUST be a JSON file written to $CLAUDEOPS_STATE_DIR/handoff.json. The file SHALL conform to a versioned schema so that future changes can be detected and handled. The supervisor MUST reject handoff files with an unrecognized schema version.
Scenario: Tier 1 writes handoff for Tier 2
- WHEN Tier 1 detects one or more unhealthy services during observation
- THEN it SHALL write a JSON file to
$CLAUDEOPS_STATE_DIR/handoff.jsoncontaining:schema_versionset to1recommended_tierset to2services_affectedas a non-empty array of service name stringscheck_resultsas a non-empty array of check result objectscooldown_stateas an object with a snapshot of relevant cooldown data
- THEN it SHALL exit with code 0
Scenario: Tier 2 writes handoff for Tier 3
- WHEN Tier 2 cannot resolve the issue with safe remediation
- THEN it SHALL write a JSON file to
$CLAUDEOPS_STATE_DIR/handoff.jsoncontaining:schema_versionset to1recommended_tierset to3services_affectedas a non-empty array of service name stringscheck_resultsfrom the original Tier 1 handoffinvestigation_findingsas a non-empty string describing root cause analysisremediation_attemptedas a non-empty string describing what was tried and why it failedcooldown_stateas an object with updated cooldown data
- THEN it SHALL exit with code 0
Scenario: Handoff file not written when all services healthy
- WHEN Tier 1 completes observation and all services are healthy
- THEN it MUST NOT write a handoff file
- THEN it SHALL exit with code 0
Scenario: Handoff file not written when Tier 2 resolves the issue
- WHEN Tier 2 successfully remediates all affected services
- THEN it MUST NOT write a handoff file
- THEN it SHALL exit with code 0
Scenario: Check result object structure
- WHEN a check result is included in the handoff file
- THEN it MUST contain the fields:
service(string),check_type(string, one ofhttp,dns,container,database,service),status(string, one ofhealthy,degraded,down), anderror(string) - THEN it MAY contain the OPTIONAL field
response_time_ms(integer)
Scenario: Invalid handoff file rejected
- WHEN the supervisor reads a handoff file that is missing REQUIRED fields or has an unrecognized
schema_version - THEN it MUST log a warning, delete the file, and NOT spawn a next-tier process
- THEN it MUST record the parse failure as an event with level
critical
Requirement: Handoff File Lifecycle
The handoff file MUST have a well-defined lifecycle: written by one tier, read and deleted by the supervisor. The file MUST NOT persist across monitoring cycles.
Scenario: Supervisor reads and deletes handoff after Tier 1
- WHEN the Tier 1 CLI process exits with code 0
- THEN the supervisor SHALL check for the existence of
$CLAUDEOPS_STATE_DIR/handoff.json - THEN if the file exists, the supervisor SHALL read its contents, validate the schema, and delete the file before spawning the next tier
Scenario: Supervisor reads and deletes handoff after Tier 2
- WHEN the Tier 2 CLI process exits with code 0
- THEN the supervisor SHALL check for the existence of
$CLAUDEOPS_STATE_DIR/handoff.json - THEN if the file exists and
recommended_tieris3, the supervisor SHALL read, validate, and delete the file before spawning Tier 3
Scenario: Stale handoff file from previous cycle
- WHEN the supervisor starts a new monitoring cycle and a handoff file already exists from a previous cycle (e.g., supervisor crashed between read and delete)
- THEN the Tier 1 process SHALL overwrite the stale file if it needs to escalate, or the supervisor MAY delete any pre-existing handoff file before starting Tier 1
Scenario: Handoff file deleted on supervisor crash recovery
- WHEN the supervisor restarts after an unexpected termination
- THEN it SHOULD delete any existing handoff file at
$CLAUDEOPS_STATE_DIR/handoff.jsonbefore beginning its first monitoring cycle to avoid processing stale escalation data
Scenario: CLI process exits with non-zero code
- WHEN a tier's CLI process exits with a non-zero exit code
- THEN the supervisor MUST NOT read or act on any handoff file that may exist
- THEN the supervisor SHALL record the session as
failedand proceed to the next scheduled cycle
Requirement: Supervisor Escalation Logic
The Go supervisor MUST control all escalation decisions. The LLM tiers MUST NOT use the Task tool for escalation. The supervisor SHALL enforce policy checks before spawning each tier.
Scenario: Supervisor spawns Tier 2 from handoff
- WHEN the supervisor reads a valid handoff file with
recommended_tierequal to2after Tier 1 exits - THEN it SHALL create a new session record in the database with
tier=2, the configured Tier 2 model, andparent_session_idset to the Tier 1 session ID - THEN it SHALL spawn a new
claudeCLI process with--modelset to the Tier 2 model,-pset to the Tier 2 prompt file content, and--append-system-promptcontaining the serialized handoff context
Scenario: Supervisor spawns Tier 3 from handoff
- WHEN the supervisor reads a valid handoff file with
recommended_tierequal to3after Tier 2 exits - THEN it SHALL create a new session record in the database with
tier=3, the configured Tier 3 model, andparent_session_idset to the Tier 2 session ID - THEN it SHALL spawn a new
claudeCLI process with--modelset to the Tier 3 model,-pset to the Tier 3 prompt file content, and--append-system-promptcontaining the serialized handoff context
Scenario: Dry-run mode prevents escalation
- WHEN
CLAUDEOPS_DRY_RUNistrueand a handoff file requests escalation to Tier 2 or Tier 3 - THEN the supervisor MUST NOT spawn a next-tier process
- THEN the supervisor SHALL log that escalation was suppressed due to dry-run mode
- THEN the supervisor SHALL delete the handoff file
Scenario: Maximum tier limit enforced
- WHEN a handoff file requests escalation to a tier higher than the configured maximum tier (e.g.,
recommended_tier=3but max tier is2) - THEN the supervisor MUST NOT spawn the requested tier
- THEN the supervisor SHALL log that escalation was blocked by the tier limit
- THEN the supervisor SHALL send a notification via Apprise indicating the issue requires human attention
Scenario: No handoff file after session exit
- WHEN a tier's CLI process exits with code 0 and no handoff file exists
- THEN the supervisor SHALL finalize the session record and return to the normal scheduling loop
- THEN no further escalation SHALL occur for this monitoring cycle
Scenario: Escalation chain terminates at Tier 3
- WHEN Tier 3 exits (regardless of whether it wrote a handoff file)
- THEN the supervisor MUST NOT spawn any further tiers
- THEN if Tier 3 wrote a handoff file, the supervisor SHALL log a warning, delete the file, and treat it as an unresolvable issue requiring human attention
Requirement: Database Schema for Escalation Chains
The sessions table MUST include a parent_session_id column to link escalated sessions to their parent. This enables the dashboard to query and display escalation chains.
Scenario: Parent session ID column added
- WHEN the database migration for this feature runs
- THEN a new column
parent_session_id INTEGER REFERENCES sessions(id)SHALL be added to thesessionstable - THEN the column MUST allow NULL values (Tier 1 sessions have no parent)
Scenario: Tier 1 session has no parent
- WHEN the supervisor creates a session record for a Tier 1 run
- THEN the
parent_session_idcolumn SHALL be NULL
Scenario: Tier 2 session links to Tier 1 parent
- WHEN the supervisor creates a session record for a Tier 2 escalation
- THEN the
parent_session_idcolumn SHALL be set to the ID of the Tier 1 session that produced the handoff file
Scenario: Tier 3 session links to Tier 2 parent
- WHEN the supervisor creates a session record for a Tier 3 escalation
- THEN the
parent_session_idcolumn SHALL be set to the ID of the Tier 2 session that produced the handoff file
Scenario: Full escalation chain queryable
- WHEN a monitoring cycle involves all three tiers (Session #18 Tier 1, Session #19 Tier 2, Session #20 Tier 3)
- THEN Session #19
parent_session_idSHALL equal Session #18id - THEN Session #20
parent_session_idSHALL equal Session #19id - THEN querying the chain from any session in the chain SHALL return all linked sessions
Scenario: Index on parent_session_id
- WHEN the migration runs
- THEN an index SHOULD be created on
parent_session_idto support efficient chain lookups
Requirement: Tier Prompt Changes
The Tier 1 and Tier 2 prompt files MUST be updated to instruct the agent to write handoff files instead of using the Task tool for escalation. The Task tool MUST be removed from the allowed tools list for Tier 1 and Tier 2.
Scenario: Tier 1 prompt writes handoff instead of spawning Task
- WHEN the Tier 1 agent detects unhealthy services
- THEN the prompt SHALL instruct it to write a handoff JSON file to
$CLAUDEOPS_STATE_DIR/handoff.jsonwith the required fields and exit - THEN the prompt MUST NOT contain instructions to use the
Tasktool for escalation
Scenario: Tier 2 prompt writes handoff instead of spawning Task
- WHEN the Tier 2 agent determines it cannot resolve the issue
- THEN the prompt SHALL instruct it to write a handoff JSON file to
$CLAUDEOPS_STATE_DIR/handoff.jsonwith investigation findings and exit - THEN the prompt MUST NOT contain instructions to use the
Tasktool for escalation
Scenario: Tier 2 receives handoff context via system prompt
- WHEN the supervisor spawns a Tier 2 CLI process
- THEN the handoff context (check results, affected services, cooldown state) SHALL be injected via the
--append-system-promptflag - THEN the Tier 2 agent SHALL be able to parse and act on the handoff context without re-running health checks
Scenario: Tier 3 receives handoff context via system prompt
- WHEN the supervisor spawns a Tier 3 CLI process
- THEN the handoff context (check results, investigation findings, remediation attempted, cooldown state) SHALL be injected via the
--append-system-promptflag
Scenario: Task tool removed from allowed tools
- WHEN the supervisor builds the
--allowedToolsflag for Tier 1 or Tier 2 - THEN the
Tasktool MUST NOT be included in the allowed tools list
Requirement: Dashboard Escalation Chain Display
The dashboard MUST display escalation chains so the operator can see the relationship between sessions in a monitoring cycle that involved multiple tiers.
Scenario: Session detail shows parent link
- WHEN the operator views a Tier 2 or Tier 3 session at
/sessions/{id} - THEN the session detail page SHALL display a link to the parent session labeled "Escalated from Session #{parent_id} (Tier {parent_tier})"
Scenario: Session detail shows child link
- WHEN the operator views a Tier 1 or Tier 2 session that triggered escalation
- THEN the session detail page SHALL display a link to the child session labeled "Escalated to Session #{child_id} (Tier {child_tier})"
Scenario: Sessions list shows escalation indicator
- WHEN the operator views the sessions list at
/sessions - THEN sessions that are part of an escalation chain SHALL display a visual indicator (e.g., chain icon or indentation) showing their relationship
Scenario: Escalation chain cost rollup
- WHEN the operator views a session that is the root of an escalation chain
- THEN the session detail page SHOULD display both the individual session cost and the total chain cost (sum of all sessions in the chain)
Requirement: Per-Tier Cost Attribution
Each tier MUST have its own session record with accurate cost, duration, and turn count metrics. The dashboard MUST display per-tier costs both individually and as part of the chain.
Scenario: Tier 1 cost recorded independently
- WHEN a Tier 1 session completes
- THEN the session record SHALL contain the
cost_usd,num_turns, andduration_msvalues from the Tier 1 CLI process result event only
Scenario: Tier 2 cost recorded independently
- WHEN a Tier 2 session completes
- THEN the session record SHALL contain the
cost_usd,num_turns, andduration_msvalues from the Tier 2 CLI process result event only, not including Tier 1 costs
Scenario: Chain cost breakdown visible
- WHEN the operator views an escalation chain in the dashboard
- THEN each tier's cost, duration, and turn count SHALL be displayed individually
- THEN the total chain cost (sum of all tiers) SHALL be displayed as a summary
Requirement: Handoff Validation Events
The supervisor MUST emit a database event linked to the completed tier's session ID whenever the escalation chain is interrupted. This makes silent failures visible in the dashboard activity feed.
Scenario: Critical event on handoff read failure
- WHEN the supervisor cannot read or parse the handoff file after a tier exits
- THEN it MUST emit an event with
level='critical'linked to that tier's session ID - THEN the event message MUST describe the reason using the format
"Escalation blocked: could not read handoff from tier N — <error>"
Scenario: Critical event on handoff validation failure
- WHEN
ValidateHandoffrejects the handoff payload (e.g., unrecognizedschema_version, missing required fields) - THEN the supervisor MUST emit an event with
level='critical'linked to that tier's session ID - THEN the event message MUST describe the reason using the format
"Escalation blocked: invalid handoff from tier N — <error>"
Scenario: Info event when dry-run suppresses escalation
- WHEN
CLAUDEOPS_DRY_RUNistrueand a valid handoff requests escalation to Tier 2 or higher - THEN the supervisor SHOULD emit an event with
level='info'linked to the current tier's session ID - THEN the event message SHOULD use the format
"Escalation suppressed (dry run): would have escalated to tier N for: service1, service2"
Requirement: Handoff Context Serialization
The handoff context injected into --append-system-prompt MUST preserve all information needed for the receiving tier to operate without re-running prior checks or investigations.
Scenario: Tier 2 receives complete failure context
- WHEN the supervisor injects handoff context into the Tier 2 system prompt
- THEN the injected text MUST include the full
check_resultsarray,services_affectedlist, andcooldown_statefrom the handoff file - THEN the text MUST be clearly labeled (e.g., wrapped in a
## Escalation Contextsection) so the agent can parse it
Scenario: Tier 3 receives complete investigation context
- WHEN the supervisor injects handoff context into the Tier 3 system prompt
- THEN the injected text MUST include all fields from the Tier 2 handoff:
check_results,services_affected,investigation_findings,remediation_attempted, andcooldown_state
Scenario: Handoff context size limit
- WHEN the serialized handoff context exceeds 50,000 characters
- THEN the supervisor SHOULD truncate the
check_resultsarray to include only results withstatusnot equal tohealthy, preserving the most critical information - THEN the supervisor MUST log a warning that handoff context was truncated