feat(lifecycle): unify MCP/LSP toolset supervision with configurable profiles + /toolsets UX#2579
Merged
dgageot merged 17 commits intodocker:mainfrom Apr 28, 2026
Conversation
Adds a small new package pkg/tools/lifecycle that holds the shared
vocabulary used by long-running toolsets (MCP stdio, remote MCP, LSP):
* errors.go — sentinel errors for transport failures, server
unavailable, server crash, init timeout, init notification,
capability missing, auth required, session missing.
* classify.go — Classify() wraps an underlying error with the right
sentinel (replaces ad-hoc substring matching). IsTransient() and
IsPermanent() let supervisor code make policy decisions via
errors.Is.
* state.go — State enum (Stopped / Starting / Ready / Degraded /
Restarting / Failed) plus a tiny thread-safe Tracker.
This is a pure refactor: nothing reads the new types yet beyond a
single alias in pkg/tools/mcp (errServerUnavailable now points at
lifecycle.ErrServerUnavailable so existing tests that check
errors.Is(err, errServerUnavailable) keep working).
The follow-up commits will plug a supervisor on top of this
foundation and migrate MCP and LSP onto it.
Assisted-By: docker-agent
Adds lifecycle.Supervisor: a single, well-tested implementation of the
connect / watch / restart / stop dance that pkg/tools/mcp had grown
inline as watchConnection / tryRestart / forceReconnectAndWait. The
supervisor:
* accepts a transport-agnostic Connector that returns a Session
(Wait + Close), so MCP, remote MCP, and the upcoming LSP port all
share the same lifecycle code.
* has a configurable Policy: Restart (Never/OnFailure/Always),
MaxAttempts, Backoff (Initial/Max/Multiplier/Jitter), and three
callback hooks (OnDisconnect, OnRestart, OnFailed). Defaults match
historical behaviour: RestartOnFailure, 5 attempts, 1s..32s.
* tracks state via the Tracker added in the previous commit, so
callers can query State() / IsReady() / Restarted() instead of
reading internal fields.
* implements RestartAndWait via a single 'forceRestart' flag and a
swappable 'restarted' channel, fixing a subtle race where Close()
on a clean session caused Wait()==nil and the watcher would fall
through to Failed instead of reconnecting.
mcp.Toolset is rewritten on top of the supervisor:
* deletes started / stopping / watcherAlive / restarted state
fields and the watchConnection / tryRestart / forceReconnectAndWait
methods (~150 lines).
* the initialize handshake (including the upstream-bug retry for
'failed to send initialized notification') moves into a small
clientConnector adapter; the existing mcpClient interface is
reused as a Session.
* cache invalidation and tool/prompt refresh happen via
OnDisconnect / OnRestart hooks rather than open-coded inside the
watcher loop.
* forceReconnectAndWait becomes supervisor.RestartAndWait.
* tests that previously poked at Toolset.started / Toolset.restarted
/ Toolset.mu now go through Toolset.IsStarted() and
supervisor.Restarted(); a small testhelpers_test.go provides
newTestToolset(...) and markStartedForTesting() so per-request
code paths can be exercised without driving a full Connect.
External behaviour is preserved: the same conditions return
errServerUnavailable (now an alias to lifecycle.ErrServerUnavailable),
the same retry budget governs reconnects, and the public Toolset API
(Tools, ListPrompts, GetPrompt, callTool, Start, Stop, all the
Set*Handler methods) is unchanged. All existing tests still pass under
'go test -race'.
Assisted-By: docker-agent
Brings LSP to parity with MCP: a single supervisor owns process
spawn, the initialize/initialized handshake, the watcher goroutine,
and crash-restart with backoff. Previously gopls dying meant the
toolset was silently dead until the agent was restarted.
* lsp_lifecycle.go (new) defines lspConnector + lspSession.
Connect spawns the LSP process, runs the LSP handshake under
h.mu, and publishes cmd/stdin/stdout/capabilities on the
handler. Session.Wait() waits on cmd, mapping signal-induced
exits (Close→processCancel) to nil and real crashes to
ErrServerCrashed. Session.Close() performs the LSP shutdown
handshake, closes stdin, cancels the process context, drains
cmd.Wait, and clears handler state.
* lsp.go now creates the supervisor in NewLSPTool with
OnDisconnect that resets diagnostics (stale data is worse than
none). LSPTool.Start/Stop delegate to supervisor.
* The previous startLocked/stopLocked helpers (~100 lines of
inline lifecycle) are deleted. ensureInitialized is reduced
to a small lazy-start gate that calls supervisor.Start.
* Handshake failures are routed through lifecycle.Classify so
binary-missing / EOF / connection-refused all surface as
typed lifecycle errors (lifecycle.ErrServerUnavailable,
lifecycle.ErrTransport).
* Open-file state is reset on every reconnect so the new
server is not asked to operate on URIs it never opened.
Tests:
* lsp_lifecycle_test.go covers the visible lifecycle behaviours:
Start with a missing binary returns ErrServerUnavailable, Stop
is idempotent before Start, and a failed Start leaves the
supervisor in StateStopped (retryable) rather than Failed.
* Existing format/protocol tests still pass under -race.
Assisted-By: docker-agent
Lets users tune startup, restart, and backoff behaviour for MCP / LSP
toolsets without leaving the YAML. The simplest knob is a profile
shorthand:
- resilient (default): auto-restart on failure with exponential
backoff. Matches the historical docker-agent behaviour.
- strict: required=true, no auto-restart. The agent refuses to
start unless the toolset reaches Ready in startup_timeout.
Intended for CI / scripted runs.
- best-effort: single attempt, no restart, optional. For
experimental MCPs whose flakiness shouldn't amplify into a
restart loop.
Individual fields still override the profile, so the common case of
'resilient with a higher restart budget' is one extra line:
lifecycle:
profile: resilient
max_restarts: 10
Wiring:
* pkg/config/latest/lifecycle.go defines LifecycleConfig and
BackoffConfig, validated alongside Toolset (rejects bad
profile/restart names, jitter outside [0,1], negative durations,
use on non-mcp/lsp types).
* pkg/teamloader/lifecycle.go translates the config to a
lifecycle.Policy with a profile-then-overrides resolution order
and attaches a slog Logger tagged with the toolset name.
* mcp.NewToolsetCommand / mcp.NewRemoteToolset / builtin.NewLSPTool
now accept an optional lifecycle.Policy. The constructors merge
the user policy with internal callbacks (cache invalidation,
diagnostic reset) so callers can't accidentally override them.
* registry.go threads the resolved policy from the YAML to each
constructor.
* agent-schema.json gains the new 'lifecycle' object under both
Toolset shapes, with full descriptions and enum constraints.
* examples/lifecycle.yaml demonstrates all three profiles.
* Round-trip and validation tests cover YAML parsing, profile
defaults, explicit-override-keeps-profile-default, and the
config -> Policy translation.
Assisted-By: docker-agent
Until now the user could only learn that an MCP/LSP toolset failed via
ad-hoc warning toasts. /toolsets gives them a first-class view of every
toolset attached to the current agent, with the lifecycle state
(stopped / starting / ready / degraded / restarting / failed), the
last error, the restart count, and the human-readable description.
Wiring:
* pkg/tools/status.go: ToolsetStatus DTO consumed by all status
surfaces (TUI today, JSON status endpoints later).
* pkg/tools/capabilities.go: Statable interface — any toolset that
implements State() lifecycle.StateInfo plugs into the dashboard.
Implemented by mcp.Toolset (already exposed State() in step 2)
and *builtin.LSPTool (added here).
* pkg/runtime: CurrentAgentToolsetStatuses() walks the active
agent's toolsets, unwraps StartableToolSet, and asks each one
for its state. Toolsets without a supervisor fall back to
Ready/Stopped based on whether Start succeeded.
The remote runtime returns nil — toolset state lives server-side.
* pkg/app: thin pass-through.
* pkg/tui: new ShowToolsetsDialogMsg, /toolsets slash command, and
a small NewToolsetsDialog that renders each entry with a
coloured state badge and a truncated last-error line.
* Tests: toolsetstatus_test.go covers Statable / non-Statable /
StartableToolSet-unwrap; toolsets_test.go covers empty,
populated, and long-error truncation paths.
The dialog is a snapshot taken when /toolsets is opened. A live
subscription model is left for a follow-up commit; the current
snapshot is enough to tell the user 'gopls crashed and has been
restarting for 90s' which is the main pain point today.
Assisted-By: docker-agent
Until now LSPTool unconditionally exposed all 15 lsp_* tools to the
model. If gopls (or any other server) didn't advertise inlayHints, the
model would still receive an lsp_inlay_hints tool and the call would
fail at the server with 'method not found'.
We now only expose tools backed by capabilities the server actually
advertises:
* filterByCapabilities walks the catalogue and uses
capabilitySupports + isProviderEnabled to decide each tool's fate.
Capability values are loosely typed in LSP (missing, false, true,
or options object) so isProviderEnabled treats any non-nil,
non-false value as 'yes'.
* lsp_workspace and lsp_diagnostics are always retained:
lsp_workspace surfaces the capability matrix itself, and
diagnostics arrive as server-pushed notifications independent of
any document* provider.
* Unknown tool names (future additions) are kept by default rather
than silently filtered \u2014 a permissive fallback that fails open.
Plumbing for live updates:
* lspHandler.toolsChangedHandler is set via LSPTool.SetToolsChangedHandler,
matching the existing tools.ChangeNotifier convention used by MCP.
* lspConnector.Connect fires the handler at the end of each
successful initialize. This means: before any Connect, the model
sees the full catalogue (we don't know better yet); after the
first Connect, it sees the capability-filtered catalogue; after
a reconnect to a different server build, it sees the new server's
filtered catalogue. The runtime's existing ChangeNotifier flow
picks this up and refreshes the cached tool list.
Tests:
* lsp_capabilities_test.go covers nil-caps (full catalogue),
empty-caps (only always-on tools), provider=false vs true vs
options-object, and the SetToolsChangedHandler register-and-fire
contract.
Note on transport extraction (the second half of the original step 6):
Pulling stdio/streamable-http/SSE into a shared pkg/tools/transport
package is a substantial refactor that touches both the MCP SDK
adapter and the LSP process layer. It is deferred to a follow-up
commit so this step lands a focused, self-contained user-visible
improvement (capability-aware tool surface).
Assisted-By: docker-agent
Lands two complementary pieces of the eventual file-watcher hot-reload:
1) The diff machinery (pkg/teamloader/diff.go) that classifies a
freshly-loaded latest.Toolset slice against the running one as
Added / Removed / Changed / Unchanged. ToolsetIdentity gives each
entry a stable key (Type+Name when named, else Type+ref/cmd/url),
and ToolsetSignature is a hex SHA-256 over the full Toolset
JSON so adding new fields automatically participates in change
detection without further wiring. HasChanges() is a fast-path for
the future watcher.
2) An immediately-useful capability built on the supervisor: by-name
restart. Most users hit a 'gopls is stuck' or 'I just OAuth'd this
MCP and need it to reconnect' moment within minutes of using the
agent; today they have to restart the whole agent.
New tools.Restartable interface (Restart(ctx) error). Implemented
by mcp.Toolset and *builtin.LSPTool, both of which delegate to
supervisor.RestartAndWait under their respective timeouts.
Runtime gains RestartToolset(ctx, name); the remote runtime
refuses with a clear error since toolset lifecycle is server-side
for that case. App is a thin pass-through.
TUI: new RestartToolsetMsg, /toolset-restart slash command
('/toolset-restart gopls'), and a notification-toast handler that
surfaces success or the supervisor's error.
Tests:
* diff_test.go covers no-changes, added-only, removed-only,
changed-only, mixed (preserves order), identity matching for
named/unnamed toolsets across mcp/lsp/a2a, and signature
stability/discrimination.
* restart_test.go exercises name resolution and Restartable
dispatch in isolation.
Note on full hot-reload (file watcher + diff-driven apply):
Wiring the watcher into the agent so that Stop+Start of a single
toolset works in a live runtime is non-trivial (Agent.toolsets
needs a snapshot pattern, the tool dispatcher needs to swap
references atomically, and the runtime loop needs a quiescent
point). It is intentionally not in this commit. The pieces shipped
here \u2014 diff machinery + per-toolset restart capability \u2014 are the
primitives that flow needs and are independently useful.
Assisted-By: docker-agent
Three real issues from the post-implementation review:
1) Supervisor: RestartAndWait could block for the full timeout when the
supervisor reached a terminal state (Stop or Failed) while the
caller was waiting. Add a single-shot 'done' channel closed via
sync.Once whenever the supervisor enters Stopped (Stop) or Failed
(tryRestart exhausted, shouldRestart=false). RestartAndWait now
selects on it and returns the supervisor's last error, or
ErrNotStarted when no failure was recorded. Adds two race-detector
regression tests.
Also drops a dead 'if s.restarted == nil' check in Start: New always
initialises the channel, so the guard was unreachable.
2) LSP session: cmd.Wait could be called twice on the same *exec.Cmd
(once from the supervisor's Wait goroutine, once from Close) which
the Go runtime explicitly forbids. Refactor to a single shared
sync.Once that runs cmd.Wait in one goroutine and exposes the
result via a pre-allocated waitDone channel; both Wait and Close
block on waitDone. Capture *cmd at session-construction time so
we no longer fish it out of the handler under h.mu (the handler
nils that field during Close, which made the old code racy in
theory even if benign in practice).
3) TUI toolsets dialog: error-message truncation used byte slicing,
which can split a multi-byte UTF-8 sequence and break lipgloss
styling. Switch to rune-aware truncation via a small
truncateRunes helper. Add a regression test that feeds a long
string of 4-byte emoji and asserts the output is valid UTF-8.
Other findings from the review were verified false alarms:
- 'TOCTOU in refreshToolCache/refreshPromptCache' (mcp.go): the
cacheGen read IS under ts.mu in the current code (lines 399\u2013408
in Tools, 613\u2013620 in ListPrompts).
- 'use-after-nil from readNotifications goroutine' (lsp.go): that
goroutine only drains stderr via a thread-safe concurrent.Buffer
and never touches h.cmd / h.stdin / h.stdout, so nilling those
fields during Close is safe.
- 'Backoff zero-multiplier infinite spin': the supervisor's delay
function applies its own defaults (1s/32s/2x) when Backoff fields
are zero, so a user cannot disable the floor by writing a fully
zeroed backoff block.
Assisted-By: docker-agent
…TUI block, and doc honesty Found during a sweep of the eight commits on this branch. CORRECTNESS * supervisor: `done` channel was closed-once via sync.Once, which meant a Failed supervisor stayed permanently 'done' even after a successful recovery via Start(). Subsequent RestartAndWait calls would fire on the stale-closed channel and return the prior error instead of attempting a reconnect. Replaced sync.Once+done with mu-guarded close-or-replace: Start refreshes the channel when transitioning out of a terminal state, signalDone closes it idempotently. Added regression test TestSupervisor_RecoverFromFailedViaStart that drives Failed \u2192 Start \u2192 RestartAndWait and asserts the second wait actually reconnects. * supervisor: `forceRestart` could leak across Start cycles. If RestartAndWait set the flag while the watcher was about to give up, the flag stayed true and a fresh Start spawned a new watcher that would force-restart on its very first disconnect (even on a clean close). Cleared in the same critical section that refreshes done. * supervisor: RestartAndWait now captures `done` under mu (same as `restarted`) so the select reads a consistent snapshot even when Start replaces the channel mid-call. * mcp.Toolset.Restart / *builtin.LSPTool.Restart: the user-facing /toolset-restart command went through supervisor.RestartAndWait unconditionally. On a Failed/Stopped supervisor that returned the last error immediately because <-done fires. Both Restart() implementations now switch on State(): terminal \u2192 Start (fresh Connect), otherwise RestartAndWait. * lsp.ensureInitialized fast path: 'if h.initialized.Load() && h.cmd != nil' read h.cmd without h.mu while concurrent Close nils it under the lock \u2014 a Go-level data race per the memory model (and a real one if the runtime ever reorders the two writes in Close). Dropped the cmd nil check; the atomic initialized flag is sufficient because Connect publishes h.cmd before setting initialized=true and Close clears initialized before nilling h.cmd. UX * tui handleRestartToolset blocked Bubble Tea's update loop for up to 35s while waiting on the supervisor's reconnect. Now wraps the call in a tea.Cmd so the TUI stays responsive, and shows an immediate 'Restarting \u2026' info toast followed by a result toast when the goroutine finishes. DOCS * lifecycle.required and lifecycle.startup_timeout are documented (and validated) but the runtime does not yet honour them \u2014 the promised eager-startup phase is still on the roadmap. Updated the field godoc, agent-schema.json descriptions, and the examples/lifecycle.yaml comment to make this explicit so users setting required:true don't think the agent will refuse to start. DEAD CODE * Removed Supervisor.statusString (superseded by Tracker.Snapshot() which the /toolsets dialog actually uses). * Removed lspHandler.readLspProcessHandle (added speculatively in step 3, never consumed). * Both had malformed nolint comments where the explanation drifted to a separate block above the directive. All packages I touched pass with -race. The pre-existing race in pkg/runtime around InMemorySessionStore + streaming + cache_test remains; verified that running cooldown_manager_test or elicitation_test in isolation passes, confirming the race is in shared session state and unrelated to this branch. Assisted-By: docker-agent
Pure-refactor pass to make the lifecycle plumbing easier to read and
maintain. No features removed; all tests still pass under -race.
Net: -205 lines across 9 files.
pkg/tools/lifecycle/classify.go
* Custom 'classified' error type and 'wrap' helper replaced by
fmt.Errorf("%w: %w", ...) (Go 1.20+ multi-%w). Removes ~20 lines
and one bespoke type.
* Sentinel-set checks deduped via a small loop in isClassified /
IsTransient (single source of truth for which errors are transient
vs permanent).
* Substring matching factored into classifyByMessage.
pkg/tools/lifecycle/state.go
* State.String now uses a stateNames lookup array instead of a switch.
* Tracker doc comments tightened.
pkg/tools/lifecycle/supervisor.go
* Inlined trivial Policy.restart() helper (just returned the field).
* shouldRestart now takes the 'forced' flag as a parameter, removing
a separate variable in the watcher and making the logic linear.
* Tightened the Backoff.delay computation using min/max builtins.
* Removed the redundant 'spawnWatcher = true' branch (always true
when reached) and de-nested the post-Connect critical section.
* Doc comments shortened where the contract was already obvious;
kept the recovery semantics for done/forceRestart since those are
subtle.
pkg/tools/builtin/lsp_lifecycle.go
* Extracted spawnLSPProcess(*lspHandler) → *lspProcess, lifting ~50
lines of stdio plumbing out of Connect and confining the
process-bound context to that function (the stderr-drain goroutine
is started inside spawnLSPProcess so processCtx never leaks).
* Extracted publishSession / clearSession helpers, deduping the
h.mu-locked writes that previously appeared three times (Connect
publish, Connect failure clear, Close clear).
* Extracted runHandshake helper, encapsulating the 'close stdin on
ctx cancel' goroutine.
* Extracted fireToolsChanged helper for the OnConnect notification.
* Result: Connect is ~35 lines and reads top-to-bottom as
'spawn → publish → handshake → notify → return session'.
pkg/tools/builtin/lsp.go, pkg/tools/mcp/mcp.go
* Toolset.Restart switches simplified from a state-by-state
switch/case to 'if state.IsTerminal() { Start } else { RestartAndWait }'.
Same behaviour, less ceremony.
* mcp.Stop's nested error-return collapsed into a single guarded
branch.
pkg/teamloader/lifecycle.go
* Merged 'strict' and 'best-effort' profilePolicy branches (they
produced the same supervisor policy; the user-facing Required flag
differs at a higher layer).
* Dropped redundant explicit zero values for Backoff in the
'resilient' profile (zero already maps to supervisor defaults).
* profileOf renamed profileName for clarity.
pkg/runtime/runtime.go
* RestartToolset and toolsetStatusFor dropped their manual
StartableToolSet unwrap. tools.As already walks the wrapper chain
via Unwrapper, so the explicit unwrap was dead code.
Validation: golangci-lint clean, go run ./lint . clean,
go test ./... passes, go test -race passes for every package this
branch touches (the pre-existing race in pkg/runtime/cache_test.go
remains unrelated).
Assisted-By: docker-agent
…review Three real issues introduced (or made more reachable) by the previous simplification commit, plus one defensive guard to make the LSP ensureInitialized fast path actually safe in the face of concurrent Close. CORRECTNESS * lifecycle.State.String panicked on negative values: the simplified array-lookup version (State.String now uses 'stateNames[s]' instead of a switch) only checked 'int(s) < len(stateNames)', missing the s >= 0 guard. State(-1).String() reproduced the index-out-of-range panic. Restore the lower-bound check and add a regression test. * lspSession.Close released h.mu between the shutdown handshake and the field-clearing helper, opening a brief window where a concurrent per-request method could acquire h.mu and write a JSON-RPC request to the already-shutting-down server (which had just received 'exit'). The original pre-refactor code held h.mu across the entire teardown. Fix: split clearSession into clearSession (locks h.mu) and clearSessionLocked (caller holds h.mu); Close now does shutdown + clearSessionLocked under one lock. * clearSessionLocked was nilling cmd / stdin / stdout BEFORE setting initialized=false. Combined with the atomic-only fast path in ensureInitialized, a reader could observe initialized=true after the fields were already nilled. Reorder so initialized=false is set first; that way a concurrent fast-path reader either short-circuits to slow path (which re-checks under h.mu) or proceeds with a still-coherent session. DEFENCE-IN-DEPTH * writeMessageLocked / readMessageLocked now check h.stdin / h.stdout for nil and return lifecycle.ErrNotStarted instead of dereferencing. This makes the per-request slow path safe even if a future refactor accidentally re-introduces a torn-state window: a panic becomes a clean error. Validation: golangci-lint clean, go run ./lint . clean, full test suite passes, and -race passes for every package this commit touches. Assisted-By: docker-agent
CI runs golangci-lint v2.11.4 which has the modernize check enabled and rejected: pkg/tools/lifecycle/supervisor_test.go:60:2: atomic: var calls int32 may be simplified using atomic.Int32 (modernize) Switch scriptedConnector.calls to atomic.Int32 with .Add()/.Load() methods. Pure cleanup; same behaviour. Assisted-By: docker-agent
a0664e5 to
015584e
Compare
Adds the missing user-facing documentation for the lifecycle/supervisor work in this PR: - docs/configuration/tools/index.md — new "Toolset Lifecycle" section that explains the lifecycle block (profile presets + per-knob overrides), the `required` / `startup_timeout` not-yet-enforced caveat, and points at the runtime TUI commands. - docs/features/tui/index.md — adds /toolsets and /toolset-restart to the slash command reference. - docs/tools/lsp/index.md — documents capability filtering against the server's initialize response, plus the auto-restart behaviour and the lifecycle hook. - docs/community/troubleshooting/index.md — points users at /toolsets and /toolset-restart for diagnosing stuck toolsets, and at the lifecycle block when a flaky dependency causes restart storms. Assisted-By: docker-agent
Two pieces of feedback on the /toolsets dialog:
1. /toolsets duplicated /tools; both surfaces address the same question ("what can the agent do right now?") so they should be one.
2. The toolset description leaked Go type names — the Describe() fallback was fmt.Sprintf("%T", ts), so an LSP toolset rendered as "*builtin.LSPTool" and an MCP one as "mcp(stdio cmd=docker)". Neither helps a user reason about lifecycle.
Changes:
- Add tools.Kinder + Kind field on ToolsetStatus. mcp.Toolset.Kind() returns "MCP" or "Remote MCP" depending on the transport; builtin.LSPTool / builtin.LSPMultiplexer return "LSP". toolsets that don't implement Kinder are rendered as "Built-in" by the dialog, never as a Go type.
- Drop pkg/tui/dialog/toolsets.go. NewToolsDialog(toolsets, tools) is now the single combined view: Toolsets section on top (Name [state] Kind, with last_error / restarts indented when relevant), Tools section below (grouped by Category, indented under each heading).
- Remove the /toolsets slash command and ShowToolsetsDialogMsg. /tools now triggers the unified dialog. /toolset-restart <name> is unchanged.
- Update docs/configuration/tools, docs/features/tui, and docs/community/troubleshooting to point at /tools instead of /toolsets.
Tests: pkg/tui/dialog/tools_test.go covers the empty-state placeholders, the toolset section (Name/Kind/state/restart count/error), the Built-in fallback, the category grouping, long-error truncation, and rune-boundary truncation.
Assisted-By: docker-agent
Built-in toolsets (shell, filesystem, memory, todo, tasks, …) had no Name() method, so the /tools dialog fell back to fmt.Sprintf("%T", ts) and rendered them as "*builtin.ShellTool". The YAML already names them — `type: shell` — so use that.
Mechanism:
- New tools.Named interface + tools.WithName(ts, name) decorator. WithName is unwrap-aware: if any inner toolset already advertises a non-empty Name() it returns ts unchanged, so MCP toolsets that received a YAML `name:` keep theirs.
- ToolsetRegistry.CreateTool now wraps every toolset with WithName(cmp.Or(toolset.Name, toolset.Type)). Built-ins get "shell"/"filesystem"/… without each tool needing its own Name() method.
- mcp.Toolset and builtin.LSPTool/LSPMultiplexer get explicit Name() implementations: MCP returns the user-set name, the LSP types fall back to the command basename ("gopls", "rust-analyzer") so multiple language servers stay distinguishable.
- runtime.nameFor uses tools.GetName (which uses tools.As[Named]) so wrapper chains are transparent.
- DescribeToolSet now also walks the wrapper chain via tools.As[Describer] so it doesn't accidentally print "*tools.namedToolSet" for a wrapped toolset whose inner Describer is hidden by the new wrapper.
Tests: pkg/tools/named_test.go covers the no-op-when-already-named path, the empty-name path, and the unwrap-through-wrapper path. Existing teamloader tests that type-asserted directly on the registry output now use tools.As[*mcp.Toolset].
Assisted-By: docker-agent
Follow-up to the toolset Name() work: when an MCP toolset has no `name:` in YAML, the registry's tools.WithName wrap collapsed it to the bare YAML type "mcp" — and a config with three unnamed MCPs would render as three identical "mcp" rows in the /tools dialog. Pre-refactor each row showed its description ("mcp(stdio cmd=docker)", "mcp(remote host=api.github.com)", "mcp(ref=duckduckgo)"), which made them trivially distinguishable.
Restore that: mcp.Toolset.Name() now returns ts.name when set (so YAML-named MCPs keep their explicit, tool-prefix-providing name) and falls back to ts.description otherwise. Because Name() is non-empty, tools.WithName's no-op-when-named branch kicks in and the registry no longer overrides MCP rows with the type alone.
Tests: pkg/tools/mcp/describe_test.go now also covers TestToolsetName_PrefersConfiguredName and TestToolsetName_FallsBackToDescription so this contract is locked in.
Assisted-By: docker-agent
gtardif
previously approved these changes
Apr 28, 2026
…tion NewRemoteToolset hardwires NewKeyringTokenStore() into its OAuth transport, so the test built a real macOS-keychain-backed token store every time it ran. On dev machines that already have a docker-agent-oauth keychain item from a previous login, that is enough to pop the password prompt during `go test ./pkg/tools/mcp/...`. The Name() fallback is a single `if ts.name != "" return ts.name; return ts.description` branch shared by stdio and remote toolsets — the stdio half of the test already covers it. Drop the remote half. Assisted-By: docker-agent
rumpl
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The MCP / remote MCP / LSP toolset lifecycle has been a recurring source of instability and there's no good user-visible way to configure it. Three separate state machines (
StartableToolSet,mcp.Toolset.watchConnection/tryRestart/forceReconnectAndWait,lspHandler.startLocked/stopLocked) each implement a slightly different subset of "connect / watch / restart / stop", with hard-coded knobs (5 attempts, 1s..32s backoff, 35s session-missing timeout) and substring-based error classification. LSP has no auto-restart at all — ifgoplscrashes, the toolset is silently dead until the agent restarts.This PR replaces those three implementations with one supervisor, makes the lifecycle configurable per-toolset in YAML, and gives users a
/toolsetsview + a/toolset-restartcommand.What's in here (11 commits)
feat(tools/lifecycle): introduce typed errors and state-machine type— newpkg/tools/lifecyclepackage with sentinel errors (ErrTransport,ErrServerUnavailable,ErrServerCrashed,ErrInitTimeout,ErrInitNotification,ErrCapabilityMissing,ErrAuthRequired,ErrSessionMissing,ErrNotStarted), aClassify(err)helper that wraps an underlying error with the right sentinel,IsTransient/IsPermanenthelpers, and aStateenum (Stopped/Starting/Ready/Degraded/Restarting/Failed) with a thread-safeTracker.feat(tools/lifecycle): extract Supervisor and port mcp.Toolset onto it—lifecycle.Supervisordriven by a transport-agnosticConnector(returns aSessionwithWait/Close). ConfigurablePolicy(Restart,MaxAttempts,Backoff{Initial,Max,Multiplier,Jitter},OnDisconnect/OnRestart/OnFailedcallbacks).mcp.Toolsetis rewritten on top of it;watchConnection/tryRestart/forceReconnectAndWaitdeleted (~150 lines gone).feat(tools/builtin): port LSPTool onto lifecycle.Supervisor— LSP gets auto-restart for the first time.lspConnector+lspSessionplug into the same supervisor; the previousstartLocked/stopLockedhelpers are removed.feat(config): add per-toolset lifecycle config with profile presets— alifecycle:block on every MCP/LSP toolset in YAML, with three profiles (strict/resilient/best-effort) and per-field overrides:Schema, validation, and
examples/lifecycle.yamlare wired up.requiredandstartup_timeoutare documented as not yet enforced (a future eager-startup commit will honour them).feat(tui): add /toolsets command and supervisor-aware status surface— newtools.Statableinterface andtools.ToolsetStatusDTO; runtime + app + TUI plumbing for a/toolsetsslash command that shows each toolset's name, state, last error, and restart count in a colour-coded dialog.feat(tools/builtin): filter LSP tools by server-advertised capabilities—lsp_*tools are filtered against the server'scapabilitiesfrominitialize, so a model never getslsp_inlay_hintsagainst a server that doesn't support it.SetToolsChangedHandleris fired post-Connectso the runtime refreshes its tool list when the filtered catalogue changes.feat: foundation for hot-reload + /toolset-restart slash command—tools.Restartablecapability + runtimeRestartToolset(name)API +/toolset-restart <name>slash command (post-OAuth recovery, "gopls is stuck, please reconnect"). Plus apkg/teamloader/diff.gohelper (ToolsetIdentity,ToolsetSignature,DiffToolsets) as the foundation for a future file-watcher-driven reload.fix(lifecycle): address reviewer findings on supervisor and LSP session— three real concurrency issues uncovered during a self-review pass:RestartAndWaitcould block for the full timeout when the supervisor reached a terminal state mid-call → added adonechannel signalled on Stop / Failed.cmd.Waitwas being called twice on the same*exec.Cmd(Go forbids this) → sharedsync.Once+ pre-allocatedwaitDonechannel./toolsetsdialog → rune-awaretruncateRunes+ regression test with 4-byte emoji.fix(lifecycle): self-review fixes for supervisor recovery, LSP race, TUI block, and doc honesty— round 2:donechannel was unrecoverable afterFailed → Start(sync.Once close meant subsequentRestartAndWaitreturned the stale error). Now mu-guarded close-or-replace;Startrefreshes it when leaving a terminal state.forceRestartflag also reset.mcp.Toolset.Restart/LSPTool.Restartnow route terminal states throughStartinstead ofRestartAndWaitso/toolset-restartactually recovers aFailedtoolset.ensureInitializedfast path readh.cmd != nilwithouth.mu(data race per the Go memory model) → dropped the cmd nil-check.handleRestartToolsetwas blocking the Bubble Tea update loop for up to 35s → wrapped in atea.Cmd.required/startup_timeoutare documented (and validated) but not yet enforced — schema, godoc, and the example YAML now say so explicitly.refactor: simplify lifecycle, MCP, LSP code without changing behaviour— pure-refactor pass, −205 net lines across 9 files:classify.go: replaced customclassifiederror type withfmt.Errorf("%w: %w", ...).state.go:String()uses a lookup array.supervisor.go:shouldRestart(err, forced)takes the flag as a parameter;min/maxbuiltins; Backoff defaults inlined.lsp_lifecycle.go: extractedspawnLSPProcess/publishSession/clearSession/runHandshake/fireToolsChanged.Connectreadsspawn → publish → handshake → notify → return sessionin ~35 lines.teamloader/lifecycle.go: merged identicalstrictandbest-effortprofile branches; dropped redundant Backoff zero values.runtime.go: dropped manualStartableToolSetunwraps (tools.Asalready walks the wrapper chain).fix(lifecycle, lsp): address regressions found during simplification review— three real regressions introduced by the simplification pass plus one defence-in-depth guard:State(-1).String()panicked because the array-lookup version dropped thes >= 0guard. Restored, regression test added.lspSession.Closereleasedh.mubetween the shutdown handshake and the field clearing → window where a per-request method could write to a soon-to-be-closed stdin. Fixed by holdingh.muacross the entire teardown via a newclearSessionLocked.clearSessionLockedwas nillingcmd/stdin/stdoutbefore clearing the atomicinitializedflag →ensureInitialized's atomic-only fast path could observeinitialized=truewhile fields were already nil. Reordered.writeMessageLocked/readMessageLockeddefensively checkh.stdin/h.stdoutfor nil, returninglifecycle.ErrNotStartedinstead of panicking.Configuration example
See
examples/lifecycle.yamlfor the full demonstrable file andagent-schema.jsonfor the schema.TUI
/toolsets— table of every toolset on the current agent (state, restarts, last error, description)./toolset-restart <name>— force a supervisor-driven reconnect of the named toolset (post-OAuth recovery; non-blocking; result surfaced via toast).Validation
golangci-lint run→ 0 issuesgo run ./lint .→ no offences (899 files)go mod tidy --diff→ cleanmise test→ all packages pass, exit 0go test -race→ passes for every package this branch touches:pkg/tools/lifecycle,pkg/tools/mcp,pkg/tools/builtin,pkg/teamloader,pkg/config/latest,pkg/tui/dialog. The pre-existingpkg/runtimerace incache_test.go/streaming.go/InMemorySessionStoreis unrelated and reproduces atHEAD~11.Net diff
11 commits, ~3500 LOC added (mostly the supervisor + LSP/MCP rewiring + tests), ~600 LOC removed (the three deleted state machines).
Follow-ups intentionally NOT in this PR
requiredandstartup_timeout.pkg/teamloader/diff.goinfrastructure is there, the watcher is not).stdio/streamable-http/sse/dockershared between MCP and LSP). Capability filtering for LSP shipped here; the transport extraction was deferred to keep this PR scoped.Each is small and self-contained on top of what's here.
Assisted-By: docker-agent