Swallow sandbox container races in the stream consumer#552
Conversation
Greptile SummaryThis PR addresses teardown races introduced by PR #548's SIGKILL sandbox cleanup by catching
Confidence Score: 5/5Safe to merge — the change is narrowly scoped to teardown races and the swallowing is properly gated on the shutdown flag. Both concerns raised in earlier review rounds have been addressed: the coordinator.is_shutting_down guard ensures genuine container crashes outside teardown are still surfaced, and exc_info=True preserves tracebacks. The flag is set before the SIGKILL fires, the exception paths are correctly covered, and non-TUI shutdown paths are unaffected. No files require special attention. Important Files Changed
Reviews (3): Last reviewed commit: "Gate sandbox-error swallow on coordinato..." | Re-trigger Greptile |
The unconditional except clause would swallow genuine container failures (OOM, eviction, daemon hiccup) outside of teardown and report partial streams as successful runs. AgentCoordinator gains an is_shutting_down flag plus a mark_shutting_down() method. The TUI quit path flips it before scheduling the cleanup coroutine. The stream consumer only swallows ExecTransportError / docker.errors.NotFound when the flag is set; otherwise it re-raises so the existing error path surfaces. The warning log adds exc_info=True for diagnosis.
When the TUI quit path SIGKILL's the sandbox container (PR #548), the agent's in-flight SDK calls can race the cleanup and surface as either:
agents.sandbox.errors.ExecTransportError— fromexeccalls into the now-gone containerdocker.errors.NotFound— fromcontainer.reload()/inspect_container/ any container lookupBoth are expected during teardown. Catch them in
_run_cycle's stream consumer alongside the existing LiteLLM shutdown-race catch.Also moves the
raise stream.run_loop_exceptionline inside the try, so the stored exception path is covered too (the SDK stashes the exception on the stream object rather than re-raising it throughstream_events()).