Skip to content

Session State Debugging

Diagnose and recover from session resume failures, stale locks, and corrupted state.

Every workflow run maintains its progress in agent-output/{project}/00-session-state.json. This file tracks:

  • Which steps are complete, in progress, or pending
  • Sub-step checkpoints within each step
  • Decisions made during the workflow
  • Lock/claim ownership for concurrent session safety

The schema version is declared in the schema_version field. The current claim-based model (v2.0) adds atomic locking to prevent concurrent sessions from overwriting each other. New state files should use schema_version: "2.0". The authoritative schema definition is in .github/skills/session-resume/references/state-file-schema.md.

A human-readable companion file 00-handoff.md summarises the same state for manual inspection.

Use this decision tree when session resume is not working:

flowchart TD
    A["Session resume not working?"] --> B{"Does 00-session-state.json exist?"}
    B -- No --> C["Fresh start — file will be created automatically"]
    B -- Yes --> D{"Is the file valid JSON?"}
    D -- No --> E["Corrupted state — see Manual Recovery below"]
    D -- Yes --> F{"Check lock.heartbeat — is it stale?"}
    F -- Stale --> G["Stale lock — clear it manually or wait for auto-recovery"]
    F -- Active --> H{"Is lock.owner_id your session?"}
    H -- No --> I["Another session holds the lock — wait or clear manually"]
    H -- Yes --> J{"Check steps.N.status"}
    J --> K["pending → normal start"]
    J --> L["in_progress → resume from sub_step checkpoint"]
    J --> M["complete → step already done, move to next"]

Symptoms: JSON parse errors, missing required fields, validator failures.

Fix:

  1. Run the validator to identify the issue:

    Terminal window
    npm run validate:session-state
  2. If the file is unrecoverable, check for a backup:

    Terminal window
    ls agent-output/{project}/00-session-state.json.bak

    If a .bak file exists, restore it:

    Terminal window
    cp agent-output/{project}/00-session-state.json.bak agent-output/{project}/00-session-state.json
  3. If no backup exists, rename the corrupt file and restart:

    Terminal window
    cd agent-output/{project}
    mv 00-session-state.json 00-session-state.json.corrupt

    The Conductor creates a fresh v2.0 state file on the next run. All steps reset to pending.

Symptoms: “Lock held by another session” error, but no other session is running.

A lock is considered stale when lock.heartbeat has not been updated within the stale_threshold_ms window (default: 5 minutes).

Fix:

  1. Check the heartbeat timestamp:

    Terminal window
    jq '.lock.heartbeat' agent-output/{project}/00-session-state.json
  2. If the timestamp is older than 5 minutes and no other session is active, clear the lock:

    Terminal window
    jq '.lock = {}' agent-output/{project}/00-session-state.json > tmp.json
    mv tmp.json agent-output/{project}/00-session-state.json
  3. Resume the workflow — the Conductor will re-acquire the lock.

Symptoms: Conductor skips a step or reports it as already complete when it was never run.

Fix:

  1. Inspect the step status:

    Terminal window
    jq '.steps' agent-output/{project}/00-session-state.json
  2. Reset the step to pending:

    Terminal window
    jq '.steps."4".status = "pending" | .steps."4".sub_step = null' \
    agent-output/{project}/00-session-state.json > tmp.json
    mv tmp.json agent-output/{project}/00-session-state.json

Symptoms: Validator warns about unknown fields or missing lock/claim structure.

The v2.0 schema added lock, claim, and attempt_token fields. If you encounter a v1.0 state file, the Conductor will attempt to upgrade it automatically. If it fails, manually add the missing fields or create a fresh state file.

The decisions object in the session state tracks key choices made during the workflow:

{
"decisions": {
"iac_tool": "bicep",
"primary_region": "swedencentral",
"complexity": "standard"
}
}

Write decisions at the moment they are made (Step 1 for iac_tool, Step 2 for architecture choices). The Conductor and downstream agents read these to route workflow steps correctly.

The decision_log array provides an append-only audit trail:

{
"decision_log": [
{
"step": 1,
"key": "iac_tool",
"value": "bicep",
"reason": "Team preference and existing Bicep expertise"
}
]
}

Each step has a file load budget — a hard limit on how many files the agent loads at startup. This prevents context window exhaustion:

StepBudgetFiles Loaded
1 (Requirements)1-2 filesSession state only
2 (Architecture)2-3 filesRequirements + session state
4 (Plan)2-3 filesArchitecture + governance constraints
5 (Code)1-2 filesImplementation plan

Excess files are loaded on demand via progressive disclosure. If resume is slow, check whether context_files_used in the session state lists more files than the step’s budget allows.

Two validators check session state integrity:

Terminal window
# Validate JSON schema compliance
npm run validate:session-state
# Validate lock/claim model integrity
npm run validate:session-lock

Run these after manual edits to the state file to ensure consistency.