Skip to content

feat(lifecycle): unify MCP/LSP toolset supervision with configurable profiles + /toolsets UX#2579

Merged
dgageot merged 17 commits intodocker:mainfrom
dgageot:board/improving-mcp-and-lsp-server-lifecycle-r-9295e6e8
Apr 28, 2026
Merged

feat(lifecycle): unify MCP/LSP toolset supervision with configurable profiles + /toolsets UX#2579
dgageot merged 17 commits intodocker:mainfrom
dgageot:board/improving-mcp-and-lsp-server-lifecycle-r-9295e6e8

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented Apr 28, 2026

Motivation

The MCP / remote MCP / LSP toolset lifecycle has been a recurring source of instability and there's no good user-visible way to configure it. Three separate state machines (StartableToolSet, mcp.Toolset.watchConnection/tryRestart/forceReconnectAndWait, lspHandler.startLocked/stopLocked) each implement a slightly different subset of "connect / watch / restart / stop", with hard-coded knobs (5 attempts, 1s..32s backoff, 35s session-missing timeout) and substring-based error classification. LSP has no auto-restart at all — if gopls crashes, the toolset is silently dead until the agent restarts.

This PR replaces those three implementations with one supervisor, makes the lifecycle configurable per-toolset in YAML, and gives users a /toolsets view + a /toolset-restart command.

What's in here (11 commits)

  1. feat(tools/lifecycle): introduce typed errors and state-machine type — new pkg/tools/lifecycle package with sentinel errors (ErrTransport, ErrServerUnavailable, ErrServerCrashed, ErrInitTimeout, ErrInitNotification, ErrCapabilityMissing, ErrAuthRequired, ErrSessionMissing, ErrNotStarted), a Classify(err) helper that wraps an underlying error with the right sentinel, IsTransient/IsPermanent helpers, and a State enum (Stopped / Starting / Ready / Degraded / Restarting / Failed) with a thread-safe Tracker.

  2. feat(tools/lifecycle): extract Supervisor and port mcp.Toolset onto itlifecycle.Supervisor driven by a transport-agnostic Connector (returns a Session with Wait / Close). Configurable Policy (Restart, MaxAttempts, Backoff{Initial,Max,Multiplier,Jitter}, OnDisconnect/OnRestart/OnFailed callbacks). mcp.Toolset is rewritten on top of it; watchConnection / tryRestart / forceReconnectAndWait deleted (~150 lines gone).

  3. feat(tools/builtin): port LSPTool onto lifecycle.Supervisor — LSP gets auto-restart for the first time. lspConnector + lspSession plug into the same supervisor; the previous startLocked / stopLocked helpers are removed.

  4. feat(config): add per-toolset lifecycle config with profile presets — a lifecycle: block on every MCP/LSP toolset in YAML, with three profiles (strict / resilient / best-effort) and per-field overrides:

    toolsets:
      - type: mcp
        command: ["docker", "mcp", "gateway"]
        lifecycle:
          profile: resilient
          max_restarts: 10
          backoff:
            initial: 500ms
            max: 1m
            multiplier: 2
            jitter: 0.2

    Schema, validation, and examples/lifecycle.yaml are wired up. required and startup_timeout are documented as not yet enforced (a future eager-startup commit will honour them).

  5. feat(tui): add /toolsets command and supervisor-aware status surface — new tools.Statable interface and tools.ToolsetStatus DTO; runtime + app + TUI plumbing for a /toolsets slash command that shows each toolset's name, state, last error, and restart count in a colour-coded dialog.

  6. feat(tools/builtin): filter LSP tools by server-advertised capabilitieslsp_* tools are filtered against the server's capabilities from initialize, so a model never gets lsp_inlay_hints against a server that doesn't support it. SetToolsChangedHandler is fired post-Connect so the runtime refreshes its tool list when the filtered catalogue changes.

  7. feat: foundation for hot-reload + /toolset-restart slash commandtools.Restartable capability + runtime RestartToolset(name) API + /toolset-restart <name> slash command (post-OAuth recovery, "gopls is stuck, please reconnect"). Plus a pkg/teamloader/diff.go helper (ToolsetIdentity, ToolsetSignature, DiffToolsets) as the foundation for a future file-watcher-driven reload.

  8. fix(lifecycle): address reviewer findings on supervisor and LSP session — three real concurrency issues uncovered during a self-review pass:

    • RestartAndWait could block for the full timeout when the supervisor reached a terminal state mid-call → added a done channel signalled on Stop / Failed.
    • cmd.Wait was being called twice on the same *exec.Cmd (Go forbids this) → shared sync.Once + pre-allocated waitDone channel.
    • Byte-based truncation of multi-byte UTF-8 in the /toolsets dialog → rune-aware truncateRunes + regression test with 4-byte emoji.
  9. fix(lifecycle): self-review fixes for supervisor recovery, LSP race, TUI block, and doc honesty — round 2:

    • done channel was unrecoverable after Failed → Start (sync.Once close meant subsequent RestartAndWait returned the stale error). Now mu-guarded close-or-replace; Start refreshes it when leaving a terminal state. forceRestart flag also reset.
    • mcp.Toolset.Restart / LSPTool.Restart now route terminal states through Start instead of RestartAndWait so /toolset-restart actually recovers a Failed toolset.
    • LSP ensureInitialized fast path read h.cmd != nil without h.mu (data race per the Go memory model) → dropped the cmd nil-check.
    • TUI handleRestartToolset was blocking the Bubble Tea update loop for up to 35s → wrapped in a tea.Cmd.
    • required / startup_timeout are documented (and validated) but not yet enforced — schema, godoc, and the example YAML now say so explicitly.
  10. refactor: simplify lifecycle, MCP, LSP code without changing behaviour — pure-refactor pass, −205 net lines across 9 files:

    • classify.go: replaced custom classified error type with fmt.Errorf("%w: %w", ...).
    • state.go: String() uses a lookup array.
    • supervisor.go: shouldRestart(err, forced) takes the flag as a parameter; min/max builtins; Backoff defaults inlined.
    • lsp_lifecycle.go: extracted spawnLSPProcess / publishSession / clearSession / runHandshake / fireToolsChanged. Connect reads spawn → publish → handshake → notify → return session in ~35 lines.
    • teamloader/lifecycle.go: merged identical strict and best-effort profile branches; dropped redundant Backoff zero values.
    • runtime.go: dropped manual StartableToolSet unwraps (tools.As already walks the wrapper chain).
  11. fix(lifecycle, lsp): address regressions found during simplification review — three real regressions introduced by the simplification pass plus one defence-in-depth guard:

    • State(-1).String() panicked because the array-lookup version dropped the s >= 0 guard. Restored, regression test added.
    • lspSession.Close released h.mu between the shutdown handshake and the field clearing → window where a per-request method could write to a soon-to-be-closed stdin. Fixed by holding h.mu across the entire teardown via a new clearSessionLocked.
    • clearSessionLocked was nilling cmd/stdin/stdout before clearing the atomic initialized flag → ensureInitialized's atomic-only fast path could observe initialized=true while fields were already nil. Reordered.
    • writeMessageLocked / readMessageLocked defensively check h.stdin/h.stdout for nil, returning lifecycle.ErrNotStarted instead of panicking.

Configuration example

agents:
  root:
    toolsets:
      - type: lsp
        command: gopls
        file_types: [".go"]
        lifecycle:
          profile: strict          # required=true (informational), no auto-restart
          startup_timeout: 30s     # informational; not yet enforced

      - type: mcp
        command: ["docker", "mcp", "gateway"]
        lifecycle:
          profile: resilient       # the default, kept here for clarity
          max_restarts: 10
          backoff: { initial: 500ms, max: 1m, multiplier: 2, jitter: 0.2 }

      - type: mcp
        ref: docker:openbnb-airbnb
        lifecycle: { profile: best-effort }   # no auto-restart, optional

See examples/lifecycle.yaml for the full demonstrable file and agent-schema.json for the schema.

TUI

  • /toolsets — table of every toolset on the current agent (state, restarts, last error, description).
  • /toolset-restart <name> — force a supervisor-driven reconnect of the named toolset (post-OAuth recovery; non-blocking; result surfaced via toast).

Validation

  • golangci-lint run → 0 issues
  • go run ./lint . → no offences (899 files)
  • go mod tidy --diff → clean
  • mise test → all packages pass, exit 0
  • go test -race → passes for every package this branch touches: pkg/tools/lifecycle, pkg/tools/mcp, pkg/tools/builtin, pkg/teamloader, pkg/config/latest, pkg/tui/dialog. The pre-existing pkg/runtime race in cache_test.go / streaming.go / InMemorySessionStore is unrelated and reproduces at HEAD~11.

Net diff

11 commits, ~3500 LOC added (mostly the supervisor + LSP/MCP rewiring + tests), ~600 LOC removed (the three deleted state machines).

Follow-ups intentionally NOT in this PR

  • Eager-startup phase that actually enforces required and startup_timeout.
  • File-watcher-driven hot reload (the pkg/teamloader/diff.go infrastructure is there, the watcher is not).
  • Pluggable transports (stdio / streamable-http / sse / docker shared between MCP and LSP). Capability filtering for LSP shipped here; the transport extraction was deferred to keep this PR scoped.

Each is small and self-contained on top of what's here.

Assisted-By: docker-agent

@dgageot dgageot requested a review from a team as a code owner April 28, 2026 15:59
@dgageot dgageot marked this pull request as draft April 28, 2026 16:01
Copy link
Copy Markdown
Contributor

@aheritier aheritier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs docos

dgageot added 12 commits April 28, 2026 18:23
Adds a small new package pkg/tools/lifecycle that holds the shared
vocabulary used by long-running toolsets (MCP stdio, remote MCP, LSP):

  * errors.go — sentinel errors for transport failures, server
    unavailable, server crash, init timeout, init notification,
    capability missing, auth required, session missing.
  * classify.go — Classify() wraps an underlying error with the right
    sentinel (replaces ad-hoc substring matching). IsTransient() and
    IsPermanent() let supervisor code make policy decisions via
    errors.Is.
  * state.go — State enum (Stopped / Starting / Ready / Degraded /
    Restarting / Failed) plus a tiny thread-safe Tracker.

This is a pure refactor: nothing reads the new types yet beyond a
single alias in pkg/tools/mcp (errServerUnavailable now points at
lifecycle.ErrServerUnavailable so existing tests that check
errors.Is(err, errServerUnavailable) keep working).

The follow-up commits will plug a supervisor on top of this
foundation and migrate MCP and LSP onto it.

Assisted-By: docker-agent
Adds lifecycle.Supervisor: a single, well-tested implementation of the
connect / watch / restart / stop dance that pkg/tools/mcp had grown
inline as watchConnection / tryRestart / forceReconnectAndWait. The
supervisor:

  * accepts a transport-agnostic Connector that returns a Session
    (Wait + Close), so MCP, remote MCP, and the upcoming LSP port all
    share the same lifecycle code.
  * has a configurable Policy: Restart (Never/OnFailure/Always),
    MaxAttempts, Backoff (Initial/Max/Multiplier/Jitter), and three
    callback hooks (OnDisconnect, OnRestart, OnFailed). Defaults match
    historical behaviour: RestartOnFailure, 5 attempts, 1s..32s.
  * tracks state via the Tracker added in the previous commit, so
    callers can query State() / IsReady() / Restarted() instead of
    reading internal fields.
  * implements RestartAndWait via a single 'forceRestart' flag and a
    swappable 'restarted' channel, fixing a subtle race where Close()
    on a clean session caused Wait()==nil and the watcher would fall
    through to Failed instead of reconnecting.

mcp.Toolset is rewritten on top of the supervisor:

  * deletes started / stopping / watcherAlive / restarted state
    fields and the watchConnection / tryRestart / forceReconnectAndWait
    methods (~150 lines).
  * the initialize handshake (including the upstream-bug retry for
    'failed to send initialized notification') moves into a small
    clientConnector adapter; the existing mcpClient interface is
    reused as a Session.
  * cache invalidation and tool/prompt refresh happen via
    OnDisconnect / OnRestart hooks rather than open-coded inside the
    watcher loop.
  * forceReconnectAndWait becomes supervisor.RestartAndWait.
  * tests that previously poked at Toolset.started / Toolset.restarted
    / Toolset.mu now go through Toolset.IsStarted() and
    supervisor.Restarted(); a small testhelpers_test.go provides
    newTestToolset(...) and markStartedForTesting() so per-request
    code paths can be exercised without driving a full Connect.

External behaviour is preserved: the same conditions return
errServerUnavailable (now an alias to lifecycle.ErrServerUnavailable),
the same retry budget governs reconnects, and the public Toolset API
(Tools, ListPrompts, GetPrompt, callTool, Start, Stop, all the
Set*Handler methods) is unchanged. All existing tests still pass under
'go test -race'.

Assisted-By: docker-agent
Brings LSP to parity with MCP: a single supervisor owns process
spawn, the initialize/initialized handshake, the watcher goroutine,
and crash-restart with backoff. Previously gopls dying meant the
toolset was silently dead until the agent was restarted.

  * lsp_lifecycle.go (new) defines lspConnector + lspSession.
    Connect spawns the LSP process, runs the LSP handshake under
    h.mu, and publishes cmd/stdin/stdout/capabilities on the
    handler. Session.Wait() waits on cmd, mapping signal-induced
    exits (Close→processCancel) to nil and real crashes to
    ErrServerCrashed. Session.Close() performs the LSP shutdown
    handshake, closes stdin, cancels the process context, drains
    cmd.Wait, and clears handler state.
  * lsp.go now creates the supervisor in NewLSPTool with
    OnDisconnect that resets diagnostics (stale data is worse than
    none). LSPTool.Start/Stop delegate to supervisor.
  * The previous startLocked/stopLocked helpers (~100 lines of
    inline lifecycle) are deleted. ensureInitialized is reduced
    to a small lazy-start gate that calls supervisor.Start.
  * Handshake failures are routed through lifecycle.Classify so
    binary-missing / EOF / connection-refused all surface as
    typed lifecycle errors (lifecycle.ErrServerUnavailable,
    lifecycle.ErrTransport).
  * Open-file state is reset on every reconnect so the new
    server is not asked to operate on URIs it never opened.

Tests:
  * lsp_lifecycle_test.go covers the visible lifecycle behaviours:
    Start with a missing binary returns ErrServerUnavailable, Stop
    is idempotent before Start, and a failed Start leaves the
    supervisor in StateStopped (retryable) rather than Failed.
  * Existing format/protocol tests still pass under -race.

Assisted-By: docker-agent
Lets users tune startup, restart, and backoff behaviour for MCP / LSP
toolsets without leaving the YAML. The simplest knob is a profile
shorthand:

  - resilient (default): auto-restart on failure with exponential
    backoff. Matches the historical docker-agent behaviour.
  - strict: required=true, no auto-restart. The agent refuses to
    start unless the toolset reaches Ready in startup_timeout.
    Intended for CI / scripted runs.
  - best-effort: single attempt, no restart, optional. For
    experimental MCPs whose flakiness shouldn't amplify into a
    restart loop.

Individual fields still override the profile, so the common case of
'resilient with a higher restart budget' is one extra line:

  lifecycle:
    profile: resilient
    max_restarts: 10

Wiring:

  * pkg/config/latest/lifecycle.go defines LifecycleConfig and
    BackoffConfig, validated alongside Toolset (rejects bad
    profile/restart names, jitter outside [0,1], negative durations,
    use on non-mcp/lsp types).
  * pkg/teamloader/lifecycle.go translates the config to a
    lifecycle.Policy with a profile-then-overrides resolution order
    and attaches a slog Logger tagged with the toolset name.
  * mcp.NewToolsetCommand / mcp.NewRemoteToolset / builtin.NewLSPTool
    now accept an optional lifecycle.Policy. The constructors merge
    the user policy with internal callbacks (cache invalidation,
    diagnostic reset) so callers can't accidentally override them.
  * registry.go threads the resolved policy from the YAML to each
    constructor.
  * agent-schema.json gains the new 'lifecycle' object under both
    Toolset shapes, with full descriptions and enum constraints.
  * examples/lifecycle.yaml demonstrates all three profiles.
  * Round-trip and validation tests cover YAML parsing, profile
    defaults, explicit-override-keeps-profile-default, and the
    config -> Policy translation.

Assisted-By: docker-agent
Until now the user could only learn that an MCP/LSP toolset failed via
ad-hoc warning toasts. /toolsets gives them a first-class view of every
toolset attached to the current agent, with the lifecycle state
(stopped / starting / ready / degraded / restarting / failed), the
last error, the restart count, and the human-readable description.

Wiring:

  * pkg/tools/status.go: ToolsetStatus DTO consumed by all status
    surfaces (TUI today, JSON status endpoints later).
  * pkg/tools/capabilities.go: Statable interface — any toolset that
    implements State() lifecycle.StateInfo plugs into the dashboard.
    Implemented by mcp.Toolset (already exposed State() in step 2)
    and *builtin.LSPTool (added here).
  * pkg/runtime: CurrentAgentToolsetStatuses() walks the active
    agent's toolsets, unwraps StartableToolSet, and asks each one
    for its state. Toolsets without a supervisor fall back to
    Ready/Stopped based on whether Start succeeded.
    The remote runtime returns nil — toolset state lives server-side.
  * pkg/app: thin pass-through.
  * pkg/tui: new ShowToolsetsDialogMsg, /toolsets slash command, and
    a small NewToolsetsDialog that renders each entry with a
    coloured state badge and a truncated last-error line.
  * Tests: toolsetstatus_test.go covers Statable / non-Statable /
    StartableToolSet-unwrap; toolsets_test.go covers empty,
    populated, and long-error truncation paths.

The dialog is a snapshot taken when /toolsets is opened. A live
subscription model is left for a follow-up commit; the current
snapshot is enough to tell the user 'gopls crashed and has been
restarting for 90s' which is the main pain point today.

Assisted-By: docker-agent
Until now LSPTool unconditionally exposed all 15 lsp_* tools to the
model. If gopls (or any other server) didn't advertise inlayHints, the
model would still receive an lsp_inlay_hints tool and the call would
fail at the server with 'method not found'.

We now only expose tools backed by capabilities the server actually
advertises:

  * filterByCapabilities walks the catalogue and uses
    capabilitySupports + isProviderEnabled to decide each tool's fate.
    Capability values are loosely typed in LSP (missing, false, true,
    or options object) so isProviderEnabled treats any non-nil,
    non-false value as 'yes'.
  * lsp_workspace and lsp_diagnostics are always retained:
    lsp_workspace surfaces the capability matrix itself, and
    diagnostics arrive as server-pushed notifications independent of
    any document* provider.
  * Unknown tool names (future additions) are kept by default rather
    than silently filtered \u2014 a permissive fallback that fails open.

Plumbing for live updates:

  * lspHandler.toolsChangedHandler is set via LSPTool.SetToolsChangedHandler,
    matching the existing tools.ChangeNotifier convention used by MCP.
  * lspConnector.Connect fires the handler at the end of each
    successful initialize. This means: before any Connect, the model
    sees the full catalogue (we don't know better yet); after the
    first Connect, it sees the capability-filtered catalogue; after
    a reconnect to a different server build, it sees the new server's
    filtered catalogue. The runtime's existing ChangeNotifier flow
    picks this up and refreshes the cached tool list.

Tests:
  * lsp_capabilities_test.go covers nil-caps (full catalogue),
    empty-caps (only always-on tools), provider=false vs true vs
    options-object, and the SetToolsChangedHandler register-and-fire
    contract.

Note on transport extraction (the second half of the original step 6):
  Pulling stdio/streamable-http/SSE into a shared pkg/tools/transport
  package is a substantial refactor that touches both the MCP SDK
  adapter and the LSP process layer. It is deferred to a follow-up
  commit so this step lands a focused, self-contained user-visible
  improvement (capability-aware tool surface).

Assisted-By: docker-agent
Lands two complementary pieces of the eventual file-watcher hot-reload:

1) The diff machinery (pkg/teamloader/diff.go) that classifies a
   freshly-loaded latest.Toolset slice against the running one as
   Added / Removed / Changed / Unchanged. ToolsetIdentity gives each
   entry a stable key (Type+Name when named, else Type+ref/cmd/url),
   and ToolsetSignature is a hex SHA-256 over the full Toolset
   JSON so adding new fields automatically participates in change
   detection without further wiring. HasChanges() is a fast-path for
   the future watcher.

2) An immediately-useful capability built on the supervisor: by-name
   restart. Most users hit a 'gopls is stuck' or 'I just OAuth'd this
   MCP and need it to reconnect' moment within minutes of using the
   agent; today they have to restart the whole agent.

   New tools.Restartable interface (Restart(ctx) error). Implemented
   by mcp.Toolset and *builtin.LSPTool, both of which delegate to
   supervisor.RestartAndWait under their respective timeouts.
   Runtime gains RestartToolset(ctx, name); the remote runtime
   refuses with a clear error since toolset lifecycle is server-side
   for that case. App is a thin pass-through.

   TUI: new RestartToolsetMsg, /toolset-restart slash command
   ('/toolset-restart gopls'), and a notification-toast handler that
   surfaces success or the supervisor's error.

Tests:
  * diff_test.go covers no-changes, added-only, removed-only,
    changed-only, mixed (preserves order), identity matching for
    named/unnamed toolsets across mcp/lsp/a2a, and signature
    stability/discrimination.
  * restart_test.go exercises name resolution and Restartable
    dispatch in isolation.

Note on full hot-reload (file watcher + diff-driven apply):
  Wiring the watcher into the agent so that Stop+Start of a single
  toolset works in a live runtime is non-trivial (Agent.toolsets
  needs a snapshot pattern, the tool dispatcher needs to swap
  references atomically, and the runtime loop needs a quiescent
  point). It is intentionally not in this commit. The pieces shipped
  here \u2014 diff machinery + per-toolset restart capability \u2014 are the
  primitives that flow needs and are independently useful.

Assisted-By: docker-agent
Three real issues from the post-implementation review:

1) Supervisor: RestartAndWait could block for the full timeout when the
   supervisor reached a terminal state (Stop or Failed) while the
   caller was waiting. Add a single-shot 'done' channel closed via
   sync.Once whenever the supervisor enters Stopped (Stop) or Failed
   (tryRestart exhausted, shouldRestart=false). RestartAndWait now
   selects on it and returns the supervisor's last error, or
   ErrNotStarted when no failure was recorded. Adds two race-detector
   regression tests.

   Also drops a dead 'if s.restarted == nil' check in Start: New always
   initialises the channel, so the guard was unreachable.

2) LSP session: cmd.Wait could be called twice on the same *exec.Cmd
   (once from the supervisor's Wait goroutine, once from Close) which
   the Go runtime explicitly forbids. Refactor to a single shared
   sync.Once that runs cmd.Wait in one goroutine and exposes the
   result via a pre-allocated waitDone channel; both Wait and Close
   block on waitDone. Capture *cmd at session-construction time so
   we no longer fish it out of the handler under h.mu (the handler
   nils that field during Close, which made the old code racy in
   theory even if benign in practice).

3) TUI toolsets dialog: error-message truncation used byte slicing,
   which can split a multi-byte UTF-8 sequence and break lipgloss
   styling. Switch to rune-aware truncation via a small
   truncateRunes helper. Add a regression test that feeds a long
   string of 4-byte emoji and asserts the output is valid UTF-8.

Other findings from the review were verified false alarms:
  - 'TOCTOU in refreshToolCache/refreshPromptCache' (mcp.go): the
    cacheGen read IS under ts.mu in the current code (lines 399\u2013408
    in Tools, 613\u2013620 in ListPrompts).
  - 'use-after-nil from readNotifications goroutine' (lsp.go): that
    goroutine only drains stderr via a thread-safe concurrent.Buffer
    and never touches h.cmd / h.stdin / h.stdout, so nilling those
    fields during Close is safe.
  - 'Backoff zero-multiplier infinite spin': the supervisor's delay
    function applies its own defaults (1s/32s/2x) when Backoff fields
    are zero, so a user cannot disable the floor by writing a fully
    zeroed backoff block.

Assisted-By: docker-agent
…TUI block, and doc honesty

Found during a sweep of the eight commits on this branch.

CORRECTNESS

* supervisor: `done` channel was closed-once via sync.Once, which
  meant a Failed supervisor stayed permanently 'done' even after a
  successful recovery via Start(). Subsequent RestartAndWait calls
  would fire on the stale-closed channel and return the prior error
  instead of attempting a reconnect.
  Replaced sync.Once+done with mu-guarded close-or-replace: Start
  refreshes the channel when transitioning out of a terminal state,
  signalDone closes it idempotently. Added regression test
  TestSupervisor_RecoverFromFailedViaStart that drives Failed \u2192
  Start \u2192 RestartAndWait and asserts the second wait actually
  reconnects.

* supervisor: `forceRestart` could leak across Start cycles. If
  RestartAndWait set the flag while the watcher was about to give
  up, the flag stayed true and a fresh Start spawned a new watcher
  that would force-restart on its very first disconnect (even on a
  clean close). Cleared in the same critical section that refreshes
  done.

* supervisor: RestartAndWait now captures `done` under mu (same as
  `restarted`) so the select reads a consistent snapshot even when
  Start replaces the channel mid-call.

* mcp.Toolset.Restart / *builtin.LSPTool.Restart: the user-facing
  /toolset-restart command went through supervisor.RestartAndWait
  unconditionally. On a Failed/Stopped supervisor that returned the
  last error immediately because <-done fires. Both Restart()
  implementations now switch on State(): terminal \u2192 Start (fresh
  Connect), otherwise RestartAndWait.

* lsp.ensureInitialized fast path: 'if h.initialized.Load() &&
  h.cmd != nil' read h.cmd without h.mu while concurrent Close
  nils it under the lock \u2014 a Go-level data race per the memory
  model (and a real one if the runtime ever reorders the two
  writes in Close). Dropped the cmd nil check; the atomic
  initialized flag is sufficient because Connect publishes h.cmd
  before setting initialized=true and Close clears initialized
  before nilling h.cmd.

UX

* tui handleRestartToolset blocked Bubble Tea's update loop for
  up to 35s while waiting on the supervisor's reconnect. Now wraps
  the call in a tea.Cmd so the TUI stays responsive, and shows an
  immediate 'Restarting \u2026' info toast followed by a result toast
  when the goroutine finishes.

DOCS

* lifecycle.required and lifecycle.startup_timeout are documented
  (and validated) but the runtime does not yet honour them \u2014 the
  promised eager-startup phase is still on the roadmap. Updated
  the field godoc, agent-schema.json descriptions, and the
  examples/lifecycle.yaml comment to make this explicit so users
  setting required:true don't think the agent will refuse to start.

DEAD CODE

* Removed Supervisor.statusString (superseded by Tracker.Snapshot()
  which the /toolsets dialog actually uses).
* Removed lspHandler.readLspProcessHandle (added speculatively in
  step 3, never consumed).
* Both had malformed nolint comments where the explanation drifted
  to a separate block above the directive.

All packages I touched pass with -race. The pre-existing race in
pkg/runtime around InMemorySessionStore + streaming + cache_test
remains; verified that running cooldown_manager_test or
elicitation_test in isolation passes, confirming the race is in
shared session state and unrelated to this branch.

Assisted-By: docker-agent
Pure-refactor pass to make the lifecycle plumbing easier to read and
maintain. No features removed; all tests still pass under -race.

Net: -205 lines across 9 files.

pkg/tools/lifecycle/classify.go
  * Custom 'classified' error type and 'wrap' helper replaced by
    fmt.Errorf("%w: %w", ...) (Go 1.20+ multi-%w). Removes ~20 lines
    and one bespoke type.
  * Sentinel-set checks deduped via a small loop in isClassified /
    IsTransient (single source of truth for which errors are transient
    vs permanent).
  * Substring matching factored into classifyByMessage.

pkg/tools/lifecycle/state.go
  * State.String now uses a stateNames lookup array instead of a switch.
  * Tracker doc comments tightened.

pkg/tools/lifecycle/supervisor.go
  * Inlined trivial Policy.restart() helper (just returned the field).
  * shouldRestart now takes the 'forced' flag as a parameter, removing
    a separate variable in the watcher and making the logic linear.
  * Tightened the Backoff.delay computation using min/max builtins.
  * Removed the redundant 'spawnWatcher = true' branch (always true
    when reached) and de-nested the post-Connect critical section.
  * Doc comments shortened where the contract was already obvious;
    kept the recovery semantics for done/forceRestart since those are
    subtle.

pkg/tools/builtin/lsp_lifecycle.go
  * Extracted spawnLSPProcess(*lspHandler) → *lspProcess, lifting ~50
    lines of stdio plumbing out of Connect and confining the
    process-bound context to that function (the stderr-drain goroutine
    is started inside spawnLSPProcess so processCtx never leaks).
  * Extracted publishSession / clearSession helpers, deduping the
    h.mu-locked writes that previously appeared three times (Connect
    publish, Connect failure clear, Close clear).
  * Extracted runHandshake helper, encapsulating the 'close stdin on
    ctx cancel' goroutine.
  * Extracted fireToolsChanged helper for the OnConnect notification.
  * Result: Connect is ~35 lines and reads top-to-bottom as
    'spawn → publish → handshake → notify → return session'.

pkg/tools/builtin/lsp.go, pkg/tools/mcp/mcp.go
  * Toolset.Restart switches simplified from a state-by-state
    switch/case to 'if state.IsTerminal() { Start } else { RestartAndWait }'.
    Same behaviour, less ceremony.
  * mcp.Stop's nested error-return collapsed into a single guarded
    branch.

pkg/teamloader/lifecycle.go
  * Merged 'strict' and 'best-effort' profilePolicy branches (they
    produced the same supervisor policy; the user-facing Required flag
    differs at a higher layer).
  * Dropped redundant explicit zero values for Backoff in the
    'resilient' profile (zero already maps to supervisor defaults).
  * profileOf renamed profileName for clarity.

pkg/runtime/runtime.go
  * RestartToolset and toolsetStatusFor dropped their manual
    StartableToolSet unwrap. tools.As already walks the wrapper chain
    via Unwrapper, so the explicit unwrap was dead code.

Validation: golangci-lint clean, go run ./lint . clean,
go test ./... passes, go test -race passes for every package this
branch touches (the pre-existing race in pkg/runtime/cache_test.go
remains unrelated).

Assisted-By: docker-agent
…review

Three real issues introduced (or made more reachable) by the previous
simplification commit, plus one defensive guard to make the LSP
ensureInitialized fast path actually safe in the face of concurrent
Close.

CORRECTNESS

* lifecycle.State.String panicked on negative values: the simplified
  array-lookup version (State.String now uses 'stateNames[s]' instead
  of a switch) only checked 'int(s) < len(stateNames)', missing the
  s >= 0 guard. State(-1).String() reproduced the index-out-of-range
  panic. Restore the lower-bound check and add a regression test.

* lspSession.Close released h.mu between the shutdown handshake and
  the field-clearing helper, opening a brief window where a concurrent
  per-request method could acquire h.mu and write a JSON-RPC request
  to the already-shutting-down server (which had just received 'exit').
  The original pre-refactor code held h.mu across the entire teardown.
  Fix: split clearSession into clearSession (locks h.mu) and
  clearSessionLocked (caller holds h.mu); Close now does shutdown +
  clearSessionLocked under one lock.

* clearSessionLocked was nilling cmd / stdin / stdout BEFORE setting
  initialized=false. Combined with the atomic-only fast path in
  ensureInitialized, a reader could observe initialized=true after
  the fields were already nilled. Reorder so initialized=false is
  set first; that way a concurrent fast-path reader either
  short-circuits to slow path (which re-checks under h.mu) or
  proceeds with a still-coherent session.

DEFENCE-IN-DEPTH

* writeMessageLocked / readMessageLocked now check h.stdin / h.stdout
  for nil and return lifecycle.ErrNotStarted instead of dereferencing.
  This makes the per-request slow path safe even if a future
  refactor accidentally re-introduces a torn-state window: a panic
  becomes a clean error.

Validation: golangci-lint clean, go run ./lint . clean, full test
suite passes, and -race passes for every package this commit touches.

Assisted-By: docker-agent
CI runs golangci-lint v2.11.4 which has the modernize check enabled
and rejected:

  pkg/tools/lifecycle/supervisor_test.go:60:2: atomic: var calls int32
  may be simplified using atomic.Int32 (modernize)

Switch scriptedConnector.calls to atomic.Int32 with .Add()/.Load()
methods. Pure cleanup; same behaviour.

Assisted-By: docker-agent
@dgageot dgageot force-pushed the board/improving-mcp-and-lsp-server-lifecycle-r-9295e6e8 branch from a0664e5 to 015584e Compare April 28, 2026 16:23
@dgageot dgageot marked this pull request as ready for review April 28, 2026 16:37
dgageot added 4 commits April 28, 2026 18:43
Adds the missing user-facing documentation for the lifecycle/supervisor work in this PR:

- docs/configuration/tools/index.md — new "Toolset Lifecycle" section that explains the lifecycle block (profile presets + per-knob overrides), the `required` / `startup_timeout` not-yet-enforced caveat, and points at the runtime TUI commands.

- docs/features/tui/index.md — adds /toolsets and /toolset-restart to the slash command reference.

- docs/tools/lsp/index.md — documents capability filtering against the server's initialize response, plus the auto-restart behaviour and the lifecycle hook.

- docs/community/troubleshooting/index.md — points users at /toolsets and /toolset-restart for diagnosing stuck toolsets, and at the lifecycle block when a flaky dependency causes restart storms.

Assisted-By: docker-agent
Two pieces of feedback on the /toolsets dialog:

1. /toolsets duplicated /tools; both surfaces address the same question ("what can the agent do right now?") so they should be one.

2. The toolset description leaked Go type names — the Describe() fallback was fmt.Sprintf("%T", ts), so an LSP toolset rendered as "*builtin.LSPTool" and an MCP one as "mcp(stdio cmd=docker)". Neither helps a user reason about lifecycle.

Changes:

- Add tools.Kinder + Kind field on ToolsetStatus. mcp.Toolset.Kind() returns "MCP" or "Remote MCP" depending on the transport; builtin.LSPTool / builtin.LSPMultiplexer return "LSP". toolsets that don't implement Kinder are rendered as "Built-in" by the dialog, never as a Go type.

- Drop pkg/tui/dialog/toolsets.go. NewToolsDialog(toolsets, tools) is now the single combined view: Toolsets section on top (Name [state] Kind, with last_error / restarts indented when relevant), Tools section below (grouped by Category, indented under each heading).

- Remove the /toolsets slash command and ShowToolsetsDialogMsg. /tools now triggers the unified dialog. /toolset-restart <name> is unchanged.

- Update docs/configuration/tools, docs/features/tui, and docs/community/troubleshooting to point at /tools instead of /toolsets.

Tests: pkg/tui/dialog/tools_test.go covers the empty-state placeholders, the toolset section (Name/Kind/state/restart count/error), the Built-in fallback, the category grouping, long-error truncation, and rune-boundary truncation.

Assisted-By: docker-agent
Built-in toolsets (shell, filesystem, memory, todo, tasks, …) had no Name() method, so the /tools dialog fell back to fmt.Sprintf("%T", ts) and rendered them as "*builtin.ShellTool". The YAML already names them — `type: shell` — so use that.

Mechanism:

- New tools.Named interface + tools.WithName(ts, name) decorator. WithName is unwrap-aware: if any inner toolset already advertises a non-empty Name() it returns ts unchanged, so MCP toolsets that received a YAML `name:` keep theirs.

- ToolsetRegistry.CreateTool now wraps every toolset with WithName(cmp.Or(toolset.Name, toolset.Type)). Built-ins get "shell"/"filesystem"/… without each tool needing its own Name() method.

- mcp.Toolset and builtin.LSPTool/LSPMultiplexer get explicit Name() implementations: MCP returns the user-set name, the LSP types fall back to the command basename ("gopls", "rust-analyzer") so multiple language servers stay distinguishable.

- runtime.nameFor uses tools.GetName (which uses tools.As[Named]) so wrapper chains are transparent.

- DescribeToolSet now also walks the wrapper chain via tools.As[Describer] so it doesn't accidentally print "*tools.namedToolSet" for a wrapped toolset whose inner Describer is hidden by the new wrapper.

Tests: pkg/tools/named_test.go covers the no-op-when-already-named path, the empty-name path, and the unwrap-through-wrapper path. Existing teamloader tests that type-asserted directly on the registry output now use tools.As[*mcp.Toolset].

Assisted-By: docker-agent
Follow-up to the toolset Name() work: when an MCP toolset has no `name:` in YAML, the registry's tools.WithName wrap collapsed it to the bare YAML type "mcp" — and a config with three unnamed MCPs would render as three identical "mcp" rows in the /tools dialog. Pre-refactor each row showed its description ("mcp(stdio cmd=docker)", "mcp(remote host=api.github.com)", "mcp(ref=duckduckgo)"), which made them trivially distinguishable.

Restore that: mcp.Toolset.Name() now returns ts.name when set (so YAML-named MCPs keep their explicit, tool-prefix-providing name) and falls back to ts.description otherwise. Because Name() is non-empty, tools.WithName's no-op-when-named branch kicks in and the registry no longer overrides MCP rows with the type alone.

Tests: pkg/tools/mcp/describe_test.go now also covers TestToolsetName_PrefersConfiguredName and TestToolsetName_FallsBackToDescription so this contract is locked in.

Assisted-By: docker-agent
gtardif
gtardif previously approved these changes Apr 28, 2026
…tion

NewRemoteToolset hardwires NewKeyringTokenStore() into its OAuth transport, so the test built a real macOS-keychain-backed token store every time it ran. On dev machines that already have a docker-agent-oauth keychain item from a previous login, that is enough to pop the password prompt during `go test ./pkg/tools/mcp/...`.

The Name() fallback is a single `if ts.name != "" return ts.name; return ts.description` branch shared by stdio and remote toolsets — the stdio half of the test already covers it. Drop the remote half.

Assisted-By: docker-agent
@dgageot dgageot merged commit 3502b78 into docker:main Apr 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants