Improve scaling of bridge warm pool under load by aron-cf · Pull Request #786 · cloudflare/sandbox-sdk

aron-cf · 2026-06-26T09:39:04Z

Benchmarking the bridge warm pool surfaced two reliability problems and one speed problem. A cold burst of one hundred sandboxes failed roughly a third of the time, a primed pool still developed a latency tail beyond ten seconds for requests that overflowed the warm target, and priming an empty pool took over a minute because containers started one at a time. Underneath all of this, bridge requests were largely invisible in traces.

The pool now accepts its capacity ceiling up front through the WARM_POOL_MAX_INSTANCES variable instead of discovering it by failing a start and parsing the error, so its capacity math is correct from the first request. It refills the instant a warm container is taken rather than waiting for the next background sweep, with a guard so concurrent requests trigger at most one refill at a time and never exceed the ceiling. It also fills in small parallel batches instead of one container at a time, with capacity re-checked between batches; the batch size defaults to five and is tunable through WARM_POOL_SCALE_BATCH_SIZE. Together these took the same hundred-sandbox burst from a dozen slow sandboxes to none, with wall-clock time roughly halved.

Every bridge request is now wrapped in a custom span named for its operation and annotated with the sandbox identifier, the container identifier, and call-specific metadata such as the command, file path, tunnel port, or session. Tracing is enabled in the worker configuration with a tunable sampling rate, the container instance size is raised to standard-1, and the span instrumentation falls back to doing nothing where the runtime does not support it.

To verify, set WARM_POOL_TARGET and WARM_POOL_MAX_INSTANCES, deploy, and watch GET /v1/pool/stats during a burst: the reported ceiling is correct before any start is attempted and the warm count recovers immediately rather than after the refresh interval. Spans for each operation appear in the dashboard carrying the sandbox identifier and metadata. The change is covered by unit tests for the seeded ceiling, eager refill and its concurrency guard, batched scale-up, and the tracing helper, with the existing bridge and software development kit suites unchanged.

The streaming command and terminal endpoints record their setup metadata but close their spans when the handler returns rather than when the stream completes, a limitation of the callback-scoped span interface.

changeset-bot · 2026-06-26T09:39:13Z

🦋 Changeset detected

Latest commit: fede1f8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/sandbox	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

pkg-pr-new · 2026-06-26T09:41:14Z

Open in StackBlitz

npm i https://pkg.pr.new/@cloudflare/sandbox@786

commit: fede1f8

github-actions · 2026-06-26T09:41:16Z

📦 Preview Build

Version: 0.0.0-pr-786-fede1f82

Install the SDK preview:

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@786

🐳 Docker images were not rebuilt — no container changes detected. Use the latest release images from Docker Hub.

devin-ai-integration

Devin Review found 2 potential issues.

scuffi

nice 🔥 just the one devin bug reported worth looking at imo

The pool only learned its instance ceiling reactively, by failing a container start and parsing the platform error. Until that first 502 its capacity math ran blind, attempting starts doomed to fail. Accept the operator-known limit via WARM_POOL_MAX_INSTANCES so capacity decisions are correct from the first request, while keeping reactive learning as a backstop for lower platform-imposed limits.

Replenishment only happened on the periodic alarm, so a burst that drained the warm pool left freed slots empty until the next tick. The overflow then raced cold starts all at once, producing a long latency tail. Trigger a debounced, non-blocking refill when a pop consumes a warm container so capacity recovers immediately. A single in-flight guard prevents concurrent pops from launching overlapping sweeps, and the refill stays bounded by remaining capacity.

The pool started warm containers strictly one at a time, awaiting a full cold boot before beginning the next. Fill rate tracked per-container boot latency, leaving the pool underfilled for long windows after a cold start or burst. Start containers in bounded-parallel batches so fill time drops roughly by the batch factor, while re-checking capacity between batches keeps overshoot past the ceiling bounded. The batch size is tunable via WARM_POOL_SCALE_BATCH_SIZE and clamped to avoid a cold-start stampede.

Bridge requests were invisible in traces beyond the automatic platform instrumentation, making it hard to attribute latency to a specific sandbox or operation. Wrap each route handler in a bridge.<operation> span seeded with the sandbox ID, container UUID, and HTTP method, and annotate operation-specific metadata such as the command, file path, tunnel port, and session ID. Tracing degrades to a no-op when the runtime does not expose the custom-span API.

Add a trace head-sampling rate alongside the enabled custom spans and bump the sandbox container to standard-1 (4 vCPU / 8 GiB) for production-grade capacity.

Parallel batch starts call recordCapacityLimit() before the batch's successes are tracked, so the inferred ceiling fell below the real container count and triggered spurious capacity rejections. Recompute it from the accurate total once a batch exhausts capacity.

The tracing module was missing from the bridge key-files index and the worker README omitted WARM_POOL_MAX_INSTANCES and WARM_POOL_SCALE_BATCH_SIZE, leaving the authoritative references stale.

aron-cf marked this pull request as ready for review June 26, 2026 13:41

aron-cf requested review from ghostwriternr, scuffi and whoiskatrin as code owners June 26, 2026 13:41

devin-ai-integration Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread packages/sandbox/src/bridge/warm-pool.ts

Comment thread packages/sandbox/src/bridge/tracing.ts

scuffi approved these changes Jun 29, 2026

View reviewed changes

aron-cf added 7 commits June 30, 2026 16:36

Tune bridge config for tracing and capacity

d60e7ef

Add a trace head-sampling rate alongside the enabled custom spans and bump the sandbox container to standard-1 (4 vCPU / 8 GiB) for production-grade capacity.

Document tracing module and new warm pool vars

fede1f8

The tracing module was missing from the bridge key-files index and the worker README omitted WARM_POOL_MAX_INSTANCES and WARM_POOL_SCALE_BATCH_SIZE, leaving the authoritative references stale.

aron-cf force-pushed the bridge-warm-pool branch from 293c143 to fede1f8 Compare June 30, 2026 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve scaling of bridge warm pool under load#786

Improve scaling of bridge warm pool under load#786
aron-cf wants to merge 7 commits into
mainfrom
bridge-warm-pool

aron-cf commented Jun 26, 2026

changeset-bot Bot commented Jun 26, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

scuffi left a comment

Labels

2 participants

Uh oh!

Conversation

aron-cf commented Jun 26, 2026

changeset-bot Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

pkg-pr-new Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Preview Build

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scuffi left a comment

Choose a reason for hiding this comment

Labels

2 participants

changeset-bot Bot commented Jun 26, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading