Skip to content

Refactor boulder and ralph-loop state to per-plan/per-session storage for multi-agent concurrency#2233

Draft
mathew-cf wants to merge 1 commit intocode-yeongyu:devfrom
mathew-cf:feat/boulder-state-multi-agent
Draft

Refactor boulder and ralph-loop state to per-plan/per-session storage for multi-agent concurrency#2233
mathew-cf wants to merge 1 commit intocode-yeongyu:devfrom
mathew-cf:feat/boulder-state-multi-agent

Conversation

@mathew-cf
Copy link

@mathew-cf mathew-cf commented Mar 1, 2026

Summary

  • Migrate boulder state from a singleton boulder.json to per-plan files under .sisyphus/boulder/{plan-name}.json, preventing concurrent agents from overwriting each other's state
  • Migrate ralph-loop state to per-session files under .sisyphus/ralph-loop/{session-id}.json for the same multi-agent isolation
  • Add atomic file write utility, JSON parser, and legacy migration to preserve backward compatibility with existing .sisyphus/boulder.json and .sisyphus/ralph-loop.json

Changes

New files

  • src/features/boulder-state/per-plan-storage.ts — per-plan read/write/list/clear operations
  • src/features/boulder-state/atomic-file-ops.ts — write-to-temp-then-rename for crash safety
  • src/features/boulder-state/boulder-json-parser.ts — extracted JSON parsing with validation
  • src/features/boulder-state/legacy-migration.ts — auto-migrates singleton boulder.json → per-plan

Modified files

  • src/features/boulder-state/storage.ts — legacy-compatible wrappers delegating to per-plan storage
  • src/hooks/ralph-loop/storage.ts — per-session state with readStateForSession, findAnyActiveRalphLoopState, migrateLegacyRalphLoopState
  • src/cli/run/continuation-state.ts — session-aware boulder and ralph-loop lookups
  • src/hooks/atlas/ — updated to use findBoulderStateBySession
  • src/hooks/prometheus-md-only/ — updated boulder state imports
  • src/hooks/ralph-loop/ — updated loop controller, event handler, and hook to use per-session storage
  • src/hooks/start-work/ — updated to per-plan boulder writes
  • Tests updated across atlas, ralph-loop, prometheus-md-only, and continuation-state

Motivation

When multiple agents (e.g., two Atlas sessions) run concurrently against the same project, they share a single boulder.json and ralph-loop.json. This causes state corruption — one agent's writes clobber the other's. Per-plan and per-session file isolation eliminates the race condition entirely.

Checklist

  • Code follows project conventions
  • bun run typecheck passes
  • bun run build succeeds
  • No version changes in package.json

Summary by cubic

Refactored boulder and ralph-loop state to per-plan and per-session files for safe multi-agent concurrency. Added atomic writes with file locking and legacy migrations; hooks now use session-aware helpers across the stack.

  • New Features

    • Per-plan boulder state at .sisyphus/boulder/{plan}.json with list, find-by-session, and append/remove session IDs.
    • Per-session ralph-loop state at .sisyphus/ralph-loop/{session}.json with read/write/increment, readStateForSession, findAnyActiveRalphLoopState, and legacy migration.
    • Atomic JSON/text writes via temp-rename and lightweight file lock.
  • Refactors

    • Atlas, start-work, prometheus-md-only, and continuation-state now use session-aware APIs; start-work appends/removes session IDs per plan; atlas agent templates create per-task todos and read .sisyphus/plans/{plan}.md; atlas event handler falls back to any active boulder for background tasks.
    • Ralph-loop controller/event handler isolate state per session, prevent hijacking, clean up orphaned session files, and scope clear/increment operations per session; added automatic legacy-to-session migration.
    • Tests updated across modules; added a cache dir test override for connected providers.

Written for commit 986b157. Summary will update on new commits.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2026

All contributors have signed the CLA. Thank you! ✅
Posted by the CLA Assistant Lite bot.

@mathew-cf
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Mar 1, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 22 files

Confidence score: 2/5

  • High-impact concurrency risk: setSessionID in src/hooks/ralph-loop/loop-state-controller.ts can hijack another agent’s active state when stateDir is undefined, overwriting its session ID.
  • Clearing per-plan state in src/hooks/start-work/start-work-hook.ts can destroy shared state for other concurrent agents when switching plans, which is a likely regression in multi-agent workflows.
  • Overall risk is high because multiple severity 7–8 issues can cause cross-agent state corruption or loss, directly affecting user workflows.
  • Pay close attention to src/hooks/ralph-loop/loop-state-controller.ts, src/hooks/start-work/start-work-hook.ts, src/features/boulder-state/legacy-migration.ts - concurrent state isolation and migration safety.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/features/boulder-state/legacy-migration.ts">

<violation number="1" location="src/features/boulder-state/legacy-migration.ts:24">
P1: Legacy state migration incorrectly skipped if per-plan storage directory exists, causing potential loss of continuation state for older plans</violation>
</file>

<file name="src/hooks/ralph-loop/loop-state-controller.ts">

<violation number="1" location="src/hooks/ralph-loop/loop-state-controller.ts:90">
P1: State hijacking vulnerability in setSessionID: When stateDir is undefined, findAnyActiveRalphLoopState(directory) may return the active state of another concurrent agent. The function then overwrites that state's session_id, writes it to the new session's file, and deletes the original agent's state file, completely breaking multi-agent isolation.</violation>
</file>

<file name="src/hooks/start-work/start-work-hook.ts">

<violation number="1" location="src/hooks/start-work/start-work-hook.ts:107">
P1: Clearing the entire plan state file when switching plans destroys shared state for other concurrent agents. Instead of `clearBoulderStateForPlan(ctx.directory, existingState.plan_name)`, the code should remove only the current session's ID from the old plan's state to preserve progress for other agents.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@mathew-cf mathew-cf force-pushed the feat/boulder-state-multi-agent branch from f186cfb to 1281c48 Compare March 1, 2026 21:49
@mathew-cf
Copy link
Author

@cubic-dev-ai
Copy link

cubic-dev-ai bot commented Mar 1, 2026

@cubic-dev-ai

@mathew-cf I have started the AI code review. It will take a few minutes to complete.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 30 files

Confidence score: 2/5

  • High-risk changes: src/hooks/atlas/event-handler.ts can drop continuation injection for subagent sessions when findBoulderStateBySession returns null, which is a concrete user-impacting regression.
  • Multiple concurrency issues (temp file collisions in src/features/boulder-state/atomic-file-ops.ts, missing lock in src/features/boulder-state/per-plan-storage.ts, TOCTOU update in src/hooks/start-work/start-work-hook.ts) increase the likelihood of state corruption under parallel agents.
  • Several src/hooks/ralph-loop/loop-state-controller.ts methods behave inconsistently when sessionID is omitted, which can cause active loop state to be missed or mutated in the wrong location.
  • Pay close attention to src/hooks/atlas/event-handler.ts, src/features/boulder-state/atomic-file-ops.ts, src/features/boulder-state/per-plan-storage.ts, src/hooks/start-work/start-work-hook.ts, src/hooks/ralph-loop/loop-state-controller.ts - regression risk and concurrency hazards.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/hooks/atlas/event-handler.ts">

<violation number="1" location="src/hooks/atlas/event-handler.ts:45">
P1: Background tasks will lose continuation injection because `findBoulderStateBySession` returns null for subagent sessions not in session_ids, causing early exit at line 83</violation>
</file>

<file name="src/hooks/start-work/start-work-hook.ts">

<violation number="1" location="src/hooks/start-work/start-work-hook.ts:160">
P2: TOCTOU race condition: When updating worktree_path, the code spreads stale `existingState` which may have been modified by concurrent agents. Use atomic update pattern like `appendSessionIdForPlan` or re-read state inside the lock before writing.</violation>
</file>

<file name="src/features/boulder-state/atomic-file-ops.ts">

<violation number="1" location="src/features/boulder-state/atomic-file-ops.ts:13">
P1: Race condition in temp file naming: `Date.now()` provides only millisecond precision, causing collisions when multiple agents write to the same file path within the same millisecond. Use `randomUUID()` (already imported) to ensure unique temp file paths.</violation>
</file>

<file name="src/hooks/ralph-loop/storage.ts">

<violation number="1" location="src/hooks/ralph-loop/storage.ts:119">
P2: Duplicate logic in `findAnyActiveRalphLoopState` and `findSecondActiveRalphLoopState` functions. Both functions implement identical directory traversal, readdirSync filtering, and state parsing logic. The only difference is an additional session_id check in the second function. These should be refactored into a single shared utility with an optional `excludeSessionId` parameter to reduce maintenance burden and follow DRY principles.</violation>
</file>

<file name="src/hooks/ralph-loop/loop-state-controller.ts">

<violation number="1" location="src/hooks/ralph-loop/loop-state-controller.ts:76">
P1: `getState()` does not fall back to checking legacy singleton state when `sessionID` is omitted, returning `null` even if a legacy loop is active. The `setSessionID()` method explicitly tries `readState(directory)` (legacy singleton) before falling back to `findAnyActiveRalphLoopState()`, but `getState()` skips the legacy check. This inconsistency may cause callers to erroneously start duplicate loops or fail to resume existing ones in legacy format.</violation>

<violation number="2" location="src/hooks/ralph-loop/loop-state-controller.ts:79">
P1: clear() and incrementIteration() fail to target the active per-session state when sessionID is omitted. They pass undefined to storage functions which default to the legacy singleton path, while getState() correctly uses findAnyActiveRalphLoopState(). This inconsistency means callers omitting sessionID will silently modify the wrong file.</violation>
</file>

<file name="src/features/boulder-state/per-plan-storage.ts">

<violation number="1" location="src/features/boulder-state/per-plan-storage.ts:89">
P1: Missing file lock in `clearBoulderStateForPlan` introduces race conditions with concurrent operations</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

… for multi-agent concurrency

Migrate boulder state from a singleton boulder.json to per-plan files
under .sisyphus/boulder/{plan-name}.json, and ralph-loop state to
per-session files under .sisyphus/ralph-loop/{session-id}.json. This
prevents concurrent agents from overwriting each other's state.

Add atomic file writes, JSON parser, and legacy migration utilities.
Update atlas, prometheus-md-only, start-work hooks and continuation
state to use session-aware lookups.
@mathew-cf mathew-cf force-pushed the feat/boulder-state-multi-agent branch from 1281c48 to 986b157 Compare March 1, 2026 22:28
@mathew-cf mathew-cf marked this pull request as draft March 1, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant