Improve scaling of bridge warm pool under load#786
Open
aron-cf wants to merge 7 commits into
Open
Conversation
🦋 Changeset detectedLatest commit: fede1f8 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
commit: |
Contributor
📦 Preview BuildVersion: Install the SDK preview:
|
scuffi
approved these changes
Jun 29, 2026
scuffi
left a comment
Contributor
There was a problem hiding this comment.
nice 🔥 just the one devin bug reported worth looking at imo
The pool only learned its instance ceiling reactively, by failing a container start and parsing the platform error. Until that first 502 its capacity math ran blind, attempting starts doomed to fail. Accept the operator-known limit via WARM_POOL_MAX_INSTANCES so capacity decisions are correct from the first request, while keeping reactive learning as a backstop for lower platform-imposed limits.
Replenishment only happened on the periodic alarm, so a burst that drained the warm pool left freed slots empty until the next tick. The overflow then raced cold starts all at once, producing a long latency tail. Trigger a debounced, non-blocking refill when a pop consumes a warm container so capacity recovers immediately. A single in-flight guard prevents concurrent pops from launching overlapping sweeps, and the refill stays bounded by remaining capacity.
The pool started warm containers strictly one at a time, awaiting a full cold boot before beginning the next. Fill rate tracked per-container boot latency, leaving the pool underfilled for long windows after a cold start or burst. Start containers in bounded-parallel batches so fill time drops roughly by the batch factor, while re-checking capacity between batches keeps overshoot past the ceiling bounded. The batch size is tunable via WARM_POOL_SCALE_BATCH_SIZE and clamped to avoid a cold-start stampede.
Bridge requests were invisible in traces beyond the automatic platform instrumentation, making it hard to attribute latency to a specific sandbox or operation. Wrap each route handler in a bridge.<operation> span seeded with the sandbox ID, container UUID, and HTTP method, and annotate operation-specific metadata such as the command, file path, tunnel port, and session ID. Tracing degrades to a no-op when the runtime does not expose the custom-span API.
Add a trace head-sampling rate alongside the enabled custom spans and bump the sandbox container to standard-1 (4 vCPU / 8 GiB) for production-grade capacity.
Parallel batch starts call recordCapacityLimit() before the batch's successes are tracked, so the inferred ceiling fell below the real container count and triggered spurious capacity rejections. Recompute it from the accurate total once a batch exhausts capacity.
The tracing module was missing from the bridge key-files index and the worker README omitted WARM_POOL_MAX_INSTANCES and WARM_POOL_SCALE_BATCH_SIZE, leaving the authoritative references stale.
293c143 to
fede1f8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Benchmarking the bridge warm pool surfaced two reliability problems and one speed problem. A cold burst of one hundred sandboxes failed roughly a third of the time, a primed pool still developed a latency tail beyond ten seconds for requests that overflowed the warm target, and priming an empty pool took over a minute because containers started one at a time. Underneath all of this, bridge requests were largely invisible in traces.
The pool now accepts its capacity ceiling up front through the
WARM_POOL_MAX_INSTANCESvariable instead of discovering it by failing a start and parsing the error, so its capacity math is correct from the first request. It refills the instant a warm container is taken rather than waiting for the next background sweep, with a guard so concurrent requests trigger at most one refill at a time and never exceed the ceiling. It also fills in small parallel batches instead of one container at a time, with capacity re-checked between batches; the batch size defaults to five and is tunable throughWARM_POOL_SCALE_BATCH_SIZE. Together these took the same hundred-sandbox burst from a dozen slow sandboxes to none, with wall-clock time roughly halved.Every bridge request is now wrapped in a custom span named for its operation and annotated with the sandbox identifier, the container identifier, and call-specific metadata such as the command, file path, tunnel port, or session. Tracing is enabled in the worker configuration with a tunable sampling rate, the container instance size is raised to
standard-1, and the span instrumentation falls back to doing nothing where the runtime does not support it.To verify, set
WARM_POOL_TARGETandWARM_POOL_MAX_INSTANCES, deploy, and watchGET /v1/pool/statsduring a burst: the reported ceiling is correct before any start is attempted and the warm count recovers immediately rather than after the refresh interval. Spans for each operation appear in the dashboard carrying the sandbox identifier and metadata. The change is covered by unit tests for the seeded ceiling, eager refill and its concurrency guard, batched scale-up, and the tracing helper, with the existing bridge and software development kit suites unchanged.The streaming command and terminal endpoints record their setup metadata but close their spans when the handler returns rather than when the stream completes, a limitation of the callback-scoped span interface.