Skip to content

Improve scaling of bridge warm pool under load#786

Open
aron-cf wants to merge 7 commits into
mainfrom
bridge-warm-pool
Open

Improve scaling of bridge warm pool under load#786
aron-cf wants to merge 7 commits into
mainfrom
bridge-warm-pool

Conversation

@aron-cf

@aron-cf aron-cf commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Benchmarking the bridge warm pool surfaced two reliability problems and one speed problem. A cold burst of one hundred sandboxes failed roughly a third of the time, a primed pool still developed a latency tail beyond ten seconds for requests that overflowed the warm target, and priming an empty pool took over a minute because containers started one at a time. Underneath all of this, bridge requests were largely invisible in traces.

The pool now accepts its capacity ceiling up front through the WARM_POOL_MAX_INSTANCES variable instead of discovering it by failing a start and parsing the error, so its capacity math is correct from the first request. It refills the instant a warm container is taken rather than waiting for the next background sweep, with a guard so concurrent requests trigger at most one refill at a time and never exceed the ceiling. It also fills in small parallel batches instead of one container at a time, with capacity re-checked between batches; the batch size defaults to five and is tunable through WARM_POOL_SCALE_BATCH_SIZE. Together these took the same hundred-sandbox burst from a dozen slow sandboxes to none, with wall-clock time roughly halved.

Every bridge request is now wrapped in a custom span named for its operation and annotated with the sandbox identifier, the container identifier, and call-specific metadata such as the command, file path, tunnel port, or session. Tracing is enabled in the worker configuration with a tunable sampling rate, the container instance size is raised to standard-1, and the span instrumentation falls back to doing nothing where the runtime does not support it.

To verify, set WARM_POOL_TARGET and WARM_POOL_MAX_INSTANCES, deploy, and watch GET /v1/pool/stats during a burst: the reported ceiling is correct before any start is attempted and the warm count recovers immediately rather than after the refresh interval. Spans for each operation appear in the dashboard carrying the sandbox identifier and metadata. The change is covered by unit tests for the seeded ceiling, eager refill and its concurrency guard, batched scale-up, and the tracing helper, with the existing bridge and software development kit suites unchanged.

The streaming command and terminal endpoints record their setup metadata but close their spans when the handler returns rather than when the stream completes, a limitation of the callback-scoped span interface.

@changeset-bot

changeset-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: fede1f8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new

pkg-pr-new Bot commented Jun 26, 2026

Copy link
Copy Markdown

Open in StackBlitz

npm i https://pkg.pr.new/@cloudflare/sandbox@786

commit: fede1f8

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

📦 Preview Build

Version: 0.0.0-pr-786-fede1f82

Install the SDK preview:

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@786

🐳 Docker images were not rebuilt — no container changes detected. Use the latest release images from Docker Hub.

@aron-cf aron-cf marked this pull request as ready for review June 26, 2026 13:41

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review
Comment thread packages/sandbox/src/bridge/warm-pool.ts
Comment thread packages/sandbox/src/bridge/tracing.ts

@scuffi scuffi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 🔥 just the one devin bug reported worth looking at imo

aron-cf added 7 commits June 30, 2026 16:36
The pool only learned its instance ceiling reactively, by failing a
container start and parsing the platform error. Until that first 502 its
capacity math ran blind, attempting starts doomed to fail. Accept the
operator-known limit via WARM_POOL_MAX_INSTANCES so capacity decisions
are correct from the first request, while keeping reactive learning as a
backstop for lower platform-imposed limits.
Replenishment only happened on the periodic alarm, so a burst that
drained the warm pool left freed slots empty until the next tick. The
overflow then raced cold starts all at once, producing a long latency
tail. Trigger a debounced, non-blocking refill when a pop consumes a
warm container so capacity recovers immediately. A single in-flight
guard prevents concurrent pops from launching overlapping sweeps, and
the refill stays bounded by remaining capacity.
The pool started warm containers strictly one at a time, awaiting a full
cold boot before beginning the next. Fill rate tracked per-container boot
latency, leaving the pool underfilled for long windows after a cold start
or burst. Start containers in bounded-parallel batches so fill time drops
roughly by the batch factor, while re-checking capacity between batches
keeps overshoot past the ceiling bounded. The batch size is tunable via
WARM_POOL_SCALE_BATCH_SIZE and clamped to avoid a cold-start stampede.
Bridge requests were invisible in traces beyond the automatic platform
instrumentation, making it hard to attribute latency to a specific
sandbox or operation. Wrap each route handler in a bridge.<operation>
span seeded with the sandbox ID, container UUID, and HTTP method, and
annotate operation-specific metadata such as the command, file path,
tunnel port, and session ID. Tracing degrades to a no-op when the runtime
does not expose the custom-span API.
Add a trace head-sampling rate alongside the enabled custom spans and bump
the sandbox container to standard-1 (4 vCPU / 8 GiB) for production-grade
capacity.
Parallel batch starts call recordCapacityLimit() before the batch's
successes are tracked, so the inferred ceiling fell below the real
container count and triggered spurious capacity rejections. Recompute it
from the accurate total once a batch exhausts capacity.
The tracing module was missing from the bridge key-files index and the
worker README omitted WARM_POOL_MAX_INSTANCES and WARM_POOL_SCALE_BATCH_SIZE,
leaving the authoritative references stale.
@aron-cf aron-cf force-pushed the bridge-warm-pool branch from 293c143 to fede1f8 Compare June 30, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants