The Cloudflare Blog

How we found a bug in the hyper HTTP library

Deanna Lam — Mon, 22 Jun 2026 18:00:00 GMT

The Images service, built in Rust on Workers, runs on every machine in Cloudflare’s edge network. To handle client connections, we use hyper, an open-source HTTP library for Rust.

Last year, we introduced the Images binding to enable custom, programmatic workflows for processing remote images in Workers. At the end of 2025, we rearchitected the binding to provide a more direct, local connection between the Workers runtime and the Images service.

Shortly after rollout, we received reports that transformation requests from the binding were failing — but only intermittently and only for larger images. Even stranger, the responses for these requests returned a 200 status without any errors logged. The image data was simply cut short: A response that should have been two megabytes might arrive with a few hundred kilobytes instead.

We spent six weeks chasing a nearly invisible bug — a race condition that occurred only under specific conditions — in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it.

Hops, handoffs, and hyper

When developers build on Cloudflare, they compose full-stack applications from a set of platform services that are accessible to Workers through bindings. Bindings provide direct APIs to resources on the Developer Platform like compute, storage, AI inference, and media processing.

The Images binding decouples image optimization from delivery; you can transcode, composite, or manipulate images without needing to return the output as an HTTP response. It also lets you apply optimization parameters in any order, rather than following the fixed sequence imposed by the URL interface. Here, a worker can pass image data directly to the Images API, chain operations together, and get the processed result back as a stream:

const result = await env.IMAGES
  .input(image)
  .transform({ width: 800, rotate: 90 })
  .output({ format: "image/avif" });
return result.response();

At a high level, this is how image data moves through our various services:

^{The pipe represents a socket connection between the intermediary and Images, where data is handed off from one process to the next through the kernel’s buffer.}

The binding communicates with Images through a socket connection managed by the Workers runtime. A socket connection is a communication channel between two processes. Each end of the socket has buffers that are managed by the operating system’s kernel; these buffers are temporary holding areas where data sits after one side writes it but before the other side reads it.

Hyper manages the connection on the Images service’s side, reading incoming requests from the socket and writing responses back to it.

When a request uses the Images binding, the Images service reads the input, performs the requested optimization operations, and encodes the result. It then passes the entire encoded image to hyper as a single in-memory block.

Hyper writes this response data into its own internal buffer. At this point, hyper considers the encoding work as complete, since it has all the bytes that it needs to send. The next step is to flush its internal buffer to the socket’s outbound buffer, moving the data from the Images service to the intermediary on the other end.

If the reader on the other end is fast, then hyper can flush everything in one pass — the outbound buffer will have room because the reader is consuming data as quickly as it arrives. Once all data is sent, hyper issues a shutdown on the socket, signaling that the connection is finished and no more data will be written. But if the reader is slower (even by a few milliseconds), then the outbound buffer fills up, and hyper needs to wait until there’s room to continue writing.

Taking the local

All incoming traffic on Cloudflare's network passes through FL, an internal intermediary service that runs security and performance features and routes requests to the appropriate backend. When we first launched the binding, image data flowed from the Workers runtime, through FL, to the Images service.

This path was a natural fit for our initial release and follows the same architecture as our URL interface. Over time, though, this coupling with FL became a constraint: Every change to the binding had to follow FL’s release cycle.

In December 2025, the Images team replaced FL with a new intermediary service, an internal worker binding that runs on the same machine. In the original architecture, data moved through FL over network sockets; this path carried the overhead of FL’s full processing pipeline, such as DNS lookups and routing.

The internal binding replaced these with Unix sockets to directly connect the services on the same machine, bypassing FL and the overhead of the network stack. This made the request path to Images faster and gave the team independent control over binding releases.

Within days of the rollout, we received our first customer report.

200 OK (not OK)

The first sign of trouble came from a customer with a non-standard setup: two layers of image processing, where one pipeline was nested inside another.

First, their worker used the Images binding to composite multiple large source images from R2 — a JPEG background plus PNG overlay layers — into a single combined JPEG. Second, they further compressed, transcoded, and resized the result through the URL interface.

^{The bug originated in the inner pipeline’s return path, where the response was truncated before reaching the outer pipeline.}

The inner pipeline (transformation binding) handled compositing. The outer pipeline (transformation URL) handled delivery optimizations like scaling and format conversion. This layered approach meant that when the inner pipeline silently returned a truncated response, the only visible error appeared one level up:

error reading a body from connection: end of file before message length reached

The outer pipeline received HTTP 200 from the inner one, with a Content-Length header that promised several megabytes. The actual body was only a fraction of that: In one request, only ~200 KB arrived out of an expected 3.3 MB. The error surfaced in the outer pipeline, but the truncation could have originated in the binding, the intermediary service, the Images service, or somewhere in between.

When a browser receives a truncated image, the result is visible. Depending on the format, the image either renders partially (e.g., with the bottom half missing or gray) or fails to decode entirely, instead displaying a broken image.

Debugging in the dark

From here, we worked inward through the request path, testing each layer to isolate where the truncation was happening. Some of these efforts hit dead ends; others left breadcrumbs that narrowed the search:

Building a reproduction. We built a worker that mimicked the customer’s nested setup, then stripped away layers until we could trigger the bug with the binding alone. A small script let us fire requests in batches. In one early run, 19 out of 25 requests failed. The amount of data that did arrive — roughly 200 KB — was suspiciously close to the size of the socket buffer in production. This confirmed that the problem wasn’t tied to the customer’s configuration and gave us a reliable way to trigger the bug on demand.
Investigating timeouts. Early on, we suspected the truncation might be related to timeout behavior (i.e., the connection was being closed after a time limit). This theory didn’t hold, as the truncation wasn’t correlated with request duration.
Updating hyper version. When the bug was first reported, we were running 0.14.x, while the latest hyper version was around 1.8.x. We tested across hyper versions 0.14, 1.7, and 1.8, just in case the most obvious answer was the correct (and easiest) one. But the bug appeared in each version, which meant that there wasn’t an upstream fix.
Reproducing locally. We ran local integration tests on macOS and a Debian VM. Even under considerable load, our local requests never triggered any failure. Making direct curl requests to the binding socket and replaying captured requests always seemed to work. The bug only appeared on the full production path when there was real concurrency and a real Workers runtime client on the other end of the socket. This led us to suspect the runtime itself.
Ruling out the Workers runtime. We examined the HTTP client that the Workers runtime uses to communicate with Images through the binding socket. None of the traces from either side of the connection showed any syscalls that indicated an unexpected close or early termination. We observed that the client behaved correctly and multiple other services used the same client without issues.
Distributed tracing. By inspecting request traces end-to-end, we confirmed that the truncated body was already present before it reached the outer transformation layer in the customer’s setup. That narrowed the problem to the inner pipeline — the binding path through the Images service.
Instrumenting the intermediary service. We added instrumentation to the intermediary service to measure body sizes before forwarding the response data. The bodies were already truncated by the time they left the Images service, so the intermediary was ruled out.
Deeper tracing within the Images service. At the service level, the request was processed, the image was properly encoded, and the response was sent with HTTP 200.

The only consistent signal was that the bug was timing-dependent: It appeared only on the production path, with real concurrency, and only for larger images.

A kernel of truth

Tools for application-level debugging told only what the system thought it was doing. But according to the system, everything was fine: Tracing said the response was sent; logging reported no errors, and the Images service returned 200 on every request.

To see what the system was actually doing, we attached strace to the Images service. strace records the syscalls that a process makes to the kernel, which could show us exactly which bytes were written, when a shutdown was called, and whether the client sent any termination signal.

Setting up the trace was delicate. strace works by intercepting syscalls as they happen, which adds a small amount of timing overhead to each one. Filtering for a narrow set of syscalls kept that overhead minimal. Broadening the filter, however, slowed the process just enough to shift the timing between the flush and the shutdown check — and make the bug disappear entirely. That alone reinforced our theory that the issue was timing-sensitive.

Using a reproduction worker, we triggered the bug and compared the syscall output between successful and failing requests.

In a successful request, the response is written in chunks as the socket buffer allows, with shutdown called only after all the data is sent. For example, this may look like:

sendto(42, "HTTP/1.1 200 OK\r\nContent-Length: 14991808\r\n...", ...) = 219264
sendto(42, "\xff\xd8\xff\xe0...", 292352) = 292352
// ... keeps writing until buffer drains ...
sendto(42, "...", 292352) = 292352
shutdown(42, SHUT_WR) = 0

When we reproduced the bug, a failing request looked like:

sendto(42, "HTTP/1.1 200 OK\r\nContent-Length: 14991808\r\n...", ...) = 219264
shutdown(42, SHUT_WR) = 0

Here, there is only one write — just enough for the headers and a sliver of the body — before the shutdown is immediately called. Out of a 14.9 MB response, only about 219 KB was sent. The remaining ~14.8 MB of image data never left hyper’s internal buffer, nor was there any termination signal from the client between the write and the shutdown. Instead, the Images service prematurely shut down the connection on its own, genuinely believing it was finished.

The failing requests confirmed that the bug was a race condition that triggered intermittently. Whether a request succeeded or failed depended on whether the flush and shutdown operations overlapped, which changed from request to request. When the buffer was still full at the exact moment that hyper decided the connection was finished, data was lost.

^{When the reader consumes slower than hyper writes, the outbound buffer fills up. If hyper shuts down the connection before the buffer drains, then only a fraction of the response makes it to the intermediary; this incomplete data gets forwarded back to the Workers runtime and the client.}

The December rearchitecture didn't introduce this bug, which had been present in hyper for years across multiple major versions. But the new intermediary changed who was reading on the response side of the socket. Our working theory is that FL, the previous intermediary, consumed data fast enough that the socket buffer rarely filled during a response. The new reader read at a pace that occasionally let the buffer fill during larger responses.

These few milliseconds of backpressure, introduced by an improvement that made everything else faster, were all it took to surface a flaw that had been hiding in plain sight.

Inside the dispatch loop

Hyper's HTTP/1 connection lifecycle is driven by a state machine in a file called dispatch.rs. It runs a loop that reads requests, writes responses, flushes the write buffer to the socket, and decides when to shut down. In simplified form:

fn poll_loop(&mut self, cx: &mut Context<'_>) -> Poll> {
    loop {
        let _ = self.poll_read(cx)?;
        let _ = self.poll_write(cx)?;
        let _ = self.poll_flush(cx)?;

        if !self.conn.wants_read_again() {
            return Poll::Ready(Ok(()));
        }
    }
}

More precisely, the let _ before poll_flush is where the bug lives.

In Rust, let _ = expr discards the expression's result, including Poll::Pending, the signal that the flush isn’t done yet. The flush might still have megabytes sitting in its buffer, but the loop never finds out.

When a request fails, this is the exact sequence of events:

The Images service finishes encoding the image and hands the entire response to hyper as a single in-memory block.
Hyper writes the block into its internal buffer and marks its write state as Writing::Closed. From an encoding standpoint, the work is done — there is nothing left to encode.
Hyper calls poll_flush to move the buffered data to the socket. In our previous example, the socket accepted about 219 KB. The remaining ~14.8 MB stays in hyper's buffer. The socket is full, so the kernel returns Poll::Pending.
poll_loop discards the Poll::Pending with let _.
It checks wants_read_again(). The full request was already received, so this returns false.
poll_loop returns Poll::Ready(Ok(())), signaling that the loop is finished, even though the flush is not.
poll_shutdown() fires. The SHUT_WR syscall is issued.
The client receives 219 KB and an EOF (end-of-file) indicating that the connection is closed, even though it expects 14.9 MB.

In the second step, hyper marks the write operation as complete as soon as the response body is buffered (i.e., when encoding is finished), rather than when it has actually been flushed. Most of the time, the flush completes in a single pass and this distinction is invisible. On the rare occasions when the socket buffer is full, the flush has to wait — even though hyper doesn't. The bytes are still sitting in hyper’s buffer, waiting to be flushed to the socket. Hyper proceeds to shut down the connection with this data still in the buffer.

This also explains why curl never triggered the bug. Curl reads data as fast as it arrives: The socket buffer never fills, the flush always completes immediately, and the discarded return value is harmless. The production path, with a reader that occasionally paused for a few milliseconds, was the only configuration where the buffer filled at exactly the wrong moment.

Don’t forget to flush

After weeks of investigation, the fix itself was conceptually simple. Hyper needed to check whether the flush was actually done before moving on.

Our reproduction worker confirmed that the bug existed, but it couldn't tell us why a given request failed. Before writing the fix, we needed a test that could trigger the exact socket conditions inside hyper.

We knew the conditions that triggered the bug: a socket that accepts one chunk of data and then blocks. To test with a controlled scenario, we built a custom wrapper around a TCP stream that simulated a full socket buffer. The wrapper accepted 8 KB on the first write, then returned Poll::Pending on every subsequent write, mimicking a reader that stopped draining the buffer.

The test sent a 500 KB response through this constrained socket and checked whether hyper called shutdown while 492 KB was still buffered. Without a fix, it did. With the fix, it waited.

Initially, we applied the fix in hyper’s dispatch loop. Instead of discarding the result of poll_flush, we checked to see whether the flush was actually done:

let flush_result = self.poll_flush(cx)?;

if flush_result.is_pending() {
    return Poll::Pending;
}

if !self.conn.wants_read_again() {
    return Poll::Ready(Ok(()));
}

If the flush hasn't completed, then the loop returns Poll::Pending to the asynchronous runtime. The runtime waits for the socket to become writable, then wakes the task back up to continue the flush. The connection shuts down only after all data has been sent.

When we deployed this fix, we observed that every byte was written and the shutdown was called only after the buffer was actually empty. The customer who made the first report also confirmed that the issue disappeared.

While our initial solution worked, the dispatch loop wasn’t the right place for the fix. Returning Poll::Pending early could slow down other operations on the same connection by reducing how frequently reads are polled, causing unintended backpressure. It also doesn't correctly handle keepalive connections, where a single connection handles multiple requests in sequence — these should remain reusable even while the previous response is still being flushed. Neither issue affected our particular service (where keepalive is disabled), but both could affect other hyper users if the fix were contributed upstream.

We traced through hyper's connection lifecycle and found a more targeted approach. Rather than changing how the dispatch loop behaves, we applied the fix at the point where shutdown is actually called. Before shutting down the socket, hyper should first flush any remaining data in its buffer:

pub(crate) fn poll_shutdown(
    &mut self,
    cx: &mut Context<'_>,
) -> Poll> {
    ready!(self.poll_flush(cx)?);
    Pin::new(&mut self.io).poll_shutdown(cx)
}

This leaves the dispatch loop unchanged. It adds a flush only at the exact point where data loss would otherwise occur — the moment before shutdown.

What stayed with us

None of the tools at the application level surfaced any errors, crashes, or log entries that provided useful clues. Application-level observability can have a blind spot for bugs that live below its awareness.

The failure occurred intermittently, scaled with response size, couldn’t be reproduced with simple tools like curl, and disappeared when we observed the system more closely. These signals pointed to a timing-dependent bug in the connection layer, not in the application logic.

Our breakthrough came from using kernel-level tooling with strace, the one layer that records what actually happened on the socket. The underlying bug lived in the few milliseconds between a partial flush and a premature shutdown — a window that opened only after we made the system faster.

We merged our fix and the deterministic test into hyperium/hyper via PR #4018. It will be available in a future hyper release, ensuring that any service using hyper’s HTTP/1 implementation won’t lose response data to the same race condition.

In the meantime, we’re running an internal fork with the patch applied. This fix stabilized the binding’s architecture, creating a reliable foundation to expand its functionality.

The Images binding initially covered only transformations of remote images. Earlier this month, we announced that the Images binding now supports operations for hosted images, giving developers a unified way to build media-rich applications on Cloudflare.

Read more about how the binding works in our documentation.

Evaluating image segmentation models for background removal for Images

Deanna Lam — Thu, 28 Aug 2025 14:00:00 GMT

Last week, we wrote about face cropping for Images, which runs an open-source face detection model in Workers AI to automatically crop images of people at scale.

It wasn’t too long ago when deploying AI workloads was prohibitively complex. Real-time inference previously required specialized (and costly) hardware, and we didn’t always have standard abstractions for deployment. We also didn’t always have Workers AI to enable developers — including ourselves — to ship AI features without this additional overhead.

And whether you’re skeptical or celebratory of AI, you’ve likely seen its explosive progression. New benchmark-breaking computational models are released every week. We now expect a fairly high degree of accuracy — the more important differentiators are how well a model fits within a product’s infrastructure and what developers do with its predictions.

This week, we’re introducing background removal for Images. This feature runs a dichotomous image segmentation model on Workers AI to isolate subjects in an image from their backgrounds. We took a controlled, deliberate approach to testing models for efficiency and accuracy.

Here’s how we evaluated various image segmentation models to develop background removal.

A primer on image segmentation

In computer vision, image segmentation is the process of splitting an image into meaningful parts.

Segmentation models produce a mask that assigns each pixel to a specific category. This differs from detection models, which don’t classify every pixel but instead mark regions of interest. A face detection model, such as the one that informs face cropping, draws bounding boxes based on where it thinks there are faces. (If you’re curious, our post on face cropping discusses how we use these bounding boxes to perform crop and zoom operations.)

Salient object detection is a type of segmentation that highlights the parts of an image that most stand out. Most salient detection models create a binary mask that categorizes the most prominent (or salient) pixels as the “foreground” and all other pixels as the “background”. In contrast, a multi-class mask considers the broader context and labels each pixel as one of several possible classes, like “dog” or “chair”. These multi-class masks are the basis of content analysis models, which distinguish which pixels belong to specific objects or types of objects.

_{In this photograph of my dog, a detection model predicts that a bounding box contains a dog; a segmentation model predicts that some pixels belong to a dog, while all other pixels don’t.}

For our use case, we needed a model that could produce a soft saliency mask, which predicts how strongly each pixel belongs to either the foreground (objects of interest) or the background. That is, each pixel is assigned a value on a scale of 0–255, where 0 is completely transparent and 255 is fully opaque. Most background pixels are labeled at (or near) 0; foreground pixels may vary in opacity, depending on its degree of saliency.

In principle, a background removal feature must be able to accurately predict saliency across a broad range of contexts. For example, e-commerce and retail vendors want to display all products on a uniform, white background; in creative and image editing applications, developers want to enable users to create stickers and cutouts from uploaded content, including images of people or avatars.

In our research, we focused primarily on the following four image segmentation models:

U²-Net (U Square Net): Trained on the largest saliency dataset (DUST-TR) of 10,553 images, which were then horizontally flipped to reach a total of 21,106 training images.
IS-Net (Intermediate Supervision Network): A novel, two-step approach from the same authors of U2-Net; this model produces cleaner boundaries for images with noisy, cluttered backgrounds.
BiRefNet (Bilateral Reference Network): Specifically designed to segment complex and high-resolution images with accuracy by checking that the small details match the big picture.
SAM (Segment Anything Model): Developed by Meta to allow segmentation by providing prompts and input points.

Different scales of information allow computational models to build a holistic view of an image. Global context considers the overall shape of objects and how areas of pixels relate to the entire image, while local context traces fine details like edges, corners, and textures. If local context focuses on the trees and their leaves, then global context represents the entire forest.

U²-Net extracts information using a multi-scale approach, where it analyzes an image at different zoom levels, then combines its predictions in a single step. The model analyzes global and local context at the same time, so it works well on images with multiple objects of varying sizes.

IS-Net introduces a new, two-step strategy called intermediate supervision. First, the model separates the foreground from the background, identifying potential areas that likely belong to objects of interest — all other pixels are labeled as the background. Second, it refines the boundaries of the highlighted objects to produce a final pixel-level mask.

The initial suppression of the background results in cleaner, more precise edges, as the segmentation focuses only on the highlighted objects of interest and is less likely to mistakenly include background pixels in the final mask. This model especially excels when dealing with complex images with cluttered backgrounds.

Both models output their predictions in a single direction for scale. U²-Net interprets the global and local context in one pass, while Is-Net begins with the global context, then focuses on the local context.

In contrast, BiRefNet refines its predictions over multiple passes, moving in both contextual directions. Like Is-Net, it initially creates a map that roughly highlights the salient object, then traces the finer details. However, BiRefNet moves from global to local context, then from local context back to global. In other words, after refining the edges of the object, it feeds the output back to the large-scale view. This way, the model can check that the small-scale details align with the broader image structure, providing higher accuracy on high-resolution images.

U²-Net, IS-Net, and BiRefNet are exclusively saliency detection models, producing masks that distinguish foreground pixels from background pixels. However, SAM was designed to be more extensible and general; its primary goal is to segment any object based on specified inputs, not only salient objects. This means that the model can also be used to create multi-class masks that label various objects within an image, even if they aren’t the primary focus of an image.

How we measure segmentation accuracy

In most saliency datasets, the actual location of the object is known as the ground-truth area. These regions are typically defined by human annotators, who manually trace objects of interest in each image. This provides a reliable reference to evaluate model predictions.

_{Photograph by}_{Allen Fang}

Each model outputs a predicted area (where it thinks the foreground pixels are), which can be compared against the ground-truth area (where the foreground pixels actually are).

Models are evaluated for segmentation accuracy based on common metrics like Intersection over Union, Dice coefficient, and pixel accuracy. Each score takes a slightly different approach to quantify the alignment between the predicted and ground-truth areas (“P” and “G”, respectively, in the formulas below).

Intersection over Union

Intersection over Union (IoU), also called the Jaccard index, measures how well the predicted area matches the true object. That is, it counts the number of foreground pixels that are shared in both the predicted and ground-truth masks. Mathematically, IoU is written as:

_{Jaccard formula}

The formula divides the intersection (P∩G), or the pixels where the predicted and ground-truth areas overlap, by the union (P∪G), or the total area of pixels that belong to either area, counting the overlapping pixels only once.

IoU produces a score between 0 and 1. A higher value indicates a closer overlap between the predicted and ground-truth areas. A perfect match, although rare, would score 1, while a smaller overlapping area brings the score closer to 0.

Dice coefficient

The Dice coefficient, also called the Sørensen–Dice index, similarly compares how well the model’s prediction matches reality, but is much more forgiving than the IoU score. It gives more weight to the shared pixels between the predicted and actual foreground, even if the areas differ in size. Mathematically, the Dice coefficient is written as:

_{Sørensen–Dice formula}

The formula divides twice the intersection (P∩G) by the sum of pixels in both predicted and ground-truth areas (P+G), counting any overlapping pixels twice.

Like IoU, the Dice coefficient also produces a value between 0 and 1, indicating a more accurate match as it approaches 1.

Pixel accuracy

Pixel accuracy measures the percentage of pixels that were correctly labeled as either the foreground or the background. Mathematically, pixel accuracy is written as:

_{Pixel accuracy formula}

The formula divides the number of correctly predicted pixels by the total number of pixels in the image.

The total area of correctly predicted pixels is the sum of foreground and background pixels that accurately match the ground-truth areas.

The correctly predicted foreground is the intersection of the predicted and ground-truth areas (P∩G). The inverse of the predicted area (P’, or 1–P) represents the pixels that the model identifies as the background; the inverse of the ground-truth area (G’, or 1–G) represents the actual boundaries of the background. When these two inverted areas overlap (P’∩G’, or (1–P)∩(1–G)), this intersection is the correctly predicted background.

Interpreting the metrics

Of the three metrics, IoU is the most conservative measure of segmentation accuracy. Small mistakes, such as including extra background pixels in the predicted foreground, reduce the score noticeably. This metric is most valuable for applications that require precise boundaries, such as autonomous driving systems.

Meanwhile, the Dice coefficient rewards the overlapping pixels more heavily, and subsequently tends to be higher than the IoU score for the same prediction. In model evaluations, this metric is favored over IoU when it’s more important to capture the object than to penalize mistakes. For example, in medical imaging, the risk of missing a true positive substantially outweighs the inconvenience of flagging a false positive.

In the context of background removal, we biased toward the IoU score and Dice coefficient over pixel accuracy. Pixel accuracy can be misleading, especially when processing an image where background pixels comprise the majority of pixels.

For example, consider an image with 900 background pixels and 100 foreground pixels. A model that correctly predicts only 5 foreground pixels — 5% of all foreground pixels — will score deceptively high in pixel accuracy. Intuitively, we’d likely say that this model performed poorly. However, assuming all 900 background pixels were correctly predicted, the model maintains 90.5% pixel accuracy, despite missing the subject almost entirely.

Pixels, predictions, and patterns

To determine the most suitable model for the Images API, we performed a series of tests using the open-source rembg library, which combines all relevant models in a single interface.

Each model was tasked with outputting a prediction mask to label foreground versus background pixels. We pulled images from two saliency datasets: Humans contains over 7,000 images of people with varying skin tones, clothing, and hairstyles, while DIS5K (version 1.5) spans a vast range of objects and scenes. If a model contained variants that were pre-trained on specific types of segmentation (e.g. clothes, humans), then we repeated the tests for the generalized model and each variant.

Our experiments were executed on a GPU with 23 GB VRAM to mirror realistic hardware constraints, similar to the environment where we already run a face detection model. We also replicated the same tests on a larger GPU instance with 94 GB VRAM; this served as an upper-bound reference point to benchmark potential speed gains if additional compute were available. Cloudflare typically reserves larger GPUs for more compute-intensive AI workloads — we viewed these tests more as an exploration for comparison than as a production scenario.

During our analysis, we started to see key trends emerge:

On the smaller GPU, inference times were generally faster for lightweight models like U²-Net (176 MB) and Is-Net (179 MB). The average speed across both datasets were 307 milliseconds for U²-Net and 351 milliseconds for Is-Net. On the opposite end, BiRefNet (973 MB) had noticeably slower output times, averaging 821 milliseconds across its two generalized variants.

BiRefNet ran 2.4 times faster on the larger GPU, reducing its average inference time to 351 milliseconds — comparable to the other models, despite its larger size. In contrast, the lighter models did not show any notable speed gain with additional compute, suggesting that scaling hardware configurations primarily benefits heavier models. In Appendix 1 (“Inference Time in Milliseconds”), we compare speed across models and GPU instances.

We also observed distinct patterns when comparing model performance across the two saliency datasets. Most notably, all models ran faster on the Humans dataset, where images of people tend to be single-subject and relatively uniform. The DIS5K dataset, in contrast, includes images with higher complexity — that is, images with more objects, cluttered backgrounds, or multiple objects of varying scales.

Slower predictions suggest a relationship between visual complexity and the computation needed to identify the important parts of an image. In other words, datasets with simpler, well-separated objects can be analyzed more quickly, while complex scenes require more computation to generate accurate masks.

Similarly, complexity challenges accuracy as much as it does efficiency. In our tests, all models demonstrated higher segmentation accuracy with the Humans dataset. In Appendix 2 (“Measures of Model Accuracy”), we present our results for segmentation accuracy across both datasets.

Specialized variants scored slightly higher in accuracy compared to their generalized counterparts. But in broad, practical applications, selecting a specialized model for every input isn’t realistic, at least for our initial beta version. We favored general-purpose models that can produce accurate predictions without prior classification. For this reason, we excluded SAM — while powerful in its intended use cases, SAM is designed to work with additional inputs. On unprompted segmentation tasks, it produced lower accuracy scores (and much higher inference times) amongst the models we tested.

All BiRefNet variants showed greater accuracy compared to other models. The generalized variants (-general and -dis) were just as accurate as its more specialized variants like -portrait. The birefnet-general variant, in particular, achieved a high IoU score of 0.87 and Dice coefficient of 0.92, averaged across both datasets.

In contrast, the generalized U²-Net model showed high accuracy on the Humans dataset, reaching an IoU score of 0.89 and a Dice coefficient of 0.94, but received a low IoU score of 0.39 and Dice coefficient of 0.52 on the DIS5K dataset. The isnet-general-use model performed substantially better, obtaining an average IoU score of 0.82 and Dice coefficient of 0.89 across both datasets.

We observed whether models could interpret both the global and local context of an image. In some scenarios, the U²-Net and Is-Net models captured the overall gist of an image, but couldn’t accurately trace fine edges. We designed one test around measuring how well each model could isolate bicycle wheels; for variety, we included images across both interior and exterior backgrounds. Lower scoring models, while correctly labeling the area surrounding the wheel, struggled with the pixels between the thin spokes and produced prediction masks that included these background pixels.

_{Photograph by}_{Yomex Owo on Unsplash}

In other scenarios, the models showed the opposite limitation: they produced masks with clean edges, but failed to identify the focus of the image. We ran another test using a photograph of a gray T-shirt against black gym flooring. Both generalized U²-Net and Is-Net models labeled only the logo as the salient object, creating a mask that omitted the rest of the shirt entirely.

Meanwhile, the BiRefNet model achieved high accuracy across both types of tests. Its architecture passes information bidirectionally, allowing details at the pixel level to be informed by the larger scene (and vice versa). In practice, this means that BiRefNet interprets how fine-grained edges fit into the broader object. For our beta version, we opted to use the BiRefNet model to drive decisions for background removal.

_{Unlike lower scoring models, the BiRefNet model understood that the entire shirt is the true subject of the image.}

Applying background removal with the Images API

The Images API now supports automatic background removal for hosted and remote images. This feature is available in open beta to all Cloudflare users on Free and Paid plans.

Use the segment parameter when optimizing an image through a specially-formatted Images URL or a worker, and Cloudflare will isolate the subject of your image and convert the background into transparent pixels. This can be combined with other optimization operations, as shown in the transformation URL below:

example.com/cdn-cgi/image/gravity=face,zoom=0.5,segment=foreground,background=white/image.png

This request will:

Crop the image toward the detected face.
Isolate the subject in the image, replacing the background with transparent pixels.
Fill the transparent pixels with a solid white color (#FFFFFF).

You can also bind the Images API to your worker to build programmatic workflows that give more fine-grained control over how images will be optimized. To demonstrate how this works, I made a simple image editing app for creating cutouts and overlays, built entirely on Images and Workers. This can be used to create images like the one below. Here, we apply background removal to isolate the dog and ice cream cone, then overlay them on a landscape image.

_{Photographs by}_{Guy Hurst}_(landscape),_{Oskar Gackowski}_{(ice cream), and me (dog)}

Here is a snippet that you can use to overlay images in a worker:

export default {
  async fetch(request,env) {
    const baseURL = "{image-url}";
    const overlayURL = "{image-url}";
    
    // Fetch responses from image URLs
    const [base, overlay] = await Promise.all([fetch(baseURL),fetch(overlayURL)]);

    return (
      await env.IMAGES
        .input(base.body)
        .draw(
          env.IMAGES.input(overlay.body)
            .transform({segment: "foreground"}), // Optimize the overlay image
            {top: 0} // Position the overlay
        )
        .output({format:"image/webp"})
    ).response();
  }
};

Background removal is another step in our ongoing effort to enable developers to build interactive and imaginative products. These features are an iterative process, and we’ll continue to refine our approach even further. We’re looking forward to sharing our progress with you.

Read more about applying background removal in our documentation.

Appendix 1: Inference Time in Milliseconds

23 GB VRAM GPU

94 GB VRAM GPU

Appendix 2: Measures of Model Accuracy

How we built AI face cropping for Images

Deanna Lam — Wed, 20 Aug 2025 14:00:00 GMT

During Developer Week 2024, we introduced AI face cropping in private beta. This feature automatically crops images around detected faces, and marks the first release in our upcoming suite of AI image manipulation capabilities.

AI face cropping is now available in Images for everyone. To bring this feature to general availability, we moved our CPU-based prototype to a GPU-based implementation in Workers AI, enabling us to address a number of technical challenges, including memory leaks that could hamper large-scale use.

^{Photograph by}^{Suad Kamardeen (@suadkamardeen) on Unsplash}

Turning raw images into production-ready assets

We developed face cropping with two particular use cases in mind:

Social media platforms and AI chatbots. We observed a lot of traffic from customers who use Images to turn unedited images of people into smaller profile pictures in neat, fixed shapes.

E-commerce platforms. The same product photo might appear in a grid of thumbnails on a gallery page, then again on an individual product page with a larger view. The following example illustrates how cropping can change the emphasis from the model’s shirt to their sunglasses.

^{Photograph by}^{Media Modifier (@mediamodifier) on Unsplash}

When handling high volumes of media content, preparing images for production can be tedious. With Images, you don’t need to manually generate and store multiple versions of the same image. Instead, we serve copies of each image, each optimized to your specifications, while you continue to store only the original image.

Crop everything, everywhere, all at once

Cloudflare provides a library of parameters to manipulate how an image is served to the end user. For example, you can crop an image to a square by setting its width and height dimensions to 100x100.

By default, images are cropped toward the center coordinates of the original image. The gravity parameter can affect how an image gets cropped by changing its focal point. You can specify coordinates to use as the focal point of an image or allow Cloudflare to automatically determine a new focal point.

^{The gravity parameter is useful when cropping images with off-centered subjects. Photograph by}^{Andrew Small (@andsmall) on Unsplash}

The gravity=auto option uses a saliency algorithm to pick the most optimal focal point of an image. Saliency detection identifies the parts of an image that are most visually important; the cropping operation is then applied toward this region of interest. Our algorithm analyzes images using visual cues such as color, luminance, and texture, but doesn’t consider context within an image. While this setting works well on images with inanimate objects like plants and skyscrapers, it doesn’t reliably account for subjects as contextually meaningful as people’s faces.

And yet, images of people comprise the majority of bandwidth usage for many applications, such as an AI chatbot platform that uses Images to serve over 45 million unique transformations each month. This presented an opportunity for us to improve how developers can optimize images of people.

AI face cropping can be performed by using the gravity=face option, which automatically detects which pixels represent the face (or faces) and uses this information to crop the image. You can also affect how closely the image is cropped toward the face; the zoom parameter controls the threshold for how much of the surrounding area around the face will be included in the image.

We carefully designed our model pipeline with privacy and confidentiality top of mind. This feature doesn’t support facial identification or recognition. In other words, when you optimize with Cloudflare, we’ll never know that two different images depict the same person, or identify the specific people in a given image. Instead, AI face cropping with Images is intentionally limited to face detection, or identifying the pixels that represent a human face.

From pixels to people

Our first step was to select an open-source model that met our requirements. Behind the scenes, our AI face cropping uses RetinaFace, a convolutional neural network model that classifies images with human faces.

A neural network is a type of machine learning process that loosely resembles how the human brain works. A basic neural network has three parts: an input layer, one or more hidden layers, and an output layer. Nodes in each layer form an interconnected network to transmit and process data, where each input node is connected to nodes in the next layer.

^{A fully connected layer passes data from one layer to the next.}

Data enters through the input layer, where it is analyzed before being passed to the first hidden layer. All of the computation is done in the hidden layers, where a result is eventually delivered through the output layer.

A convolutional neural network (CNN) mirrors how humans look at things. When we look at other people, we start with abstract features, like the outline of their body, before we process specific features, like the color of their eyes or the shape of their lips.

Similarly, a CNN processes an image piece-by-piece before delivering the final result. Earlier layers look for abstract features like edges and colors and lines; subsequent layers become more complex and are each responsible for identifying the various features that comprise a human face. The last fully connected layer combines all categorized features to produce one final classification of the entire image. In other words, if an image contains all of the individual features that define a human face (e.g. eyes, nose), then the CNN concludes that the image contains a human face.

We needed a model that could determine whether an image depicts a person (image classification), as well as exactly where they are in the image (object detection). When selecting a model, some factors we considered were:

Performance on the WIDERFACE dataset. This is the state-of-the-art face detection benchmark dataset, which contains 32,203 images of 393,703 labeled faces with a high degree of variability in scale, pose, and occlusion.
Speed (in frames per second). Most of our image optimization requests occur on delivery (rather than before an image gets uploaded to storage), so we prioritized performance for end-user delivery.
Model size. Smaller model sizes run more efficiently.
Quality. The performance boost from smaller models often gets traded for the quality—the key is balancing speed with results.

Our initial test sample contained 500 images with varying factors like the number of faces in the image, face size, lighting, sharpness, and angle. We tested various models, including BlazeFast, R-CNN (and its successors Fast R-CNN and Faster R-CNN), RetinaFace, and YOLO (You Only Look Once).

Two-stage detectors like BlazeFast and R-CNN propose potential object locations in an image, then identify objects in those regions of interest. One-stage detectors like RetinaFace and YOLO predict object locations and classes in a single pass. In our research, we observed that two-stage detector methods provided higher accuracy, but performed too slowly to be practical for real traffic. On the other hand, one-stage detector methods were efficient and performant while still highly accurate.

Ultimately, we selected RetinaFace, which showed the highest precision of 99.4% and performed faster than other models with comparable values. We found that RetinaFace delivered strong results even with images containing multiple blurry faces:

^{Photograph by}^{Anne Nygård (@polarmermaid) on Unsplash}

Inference—the process of using training models to make decisions—can be computationally demanding, especially with very large images. To maintain efficiency, we set a maximum size limit of 1024x1024 pixels when sending images to the model.

We pass images within these dimensions directly to the model for analysis. But if either width or height dimension exceeds 1024 pixels, then we instead create an inference image to send to the model; this is a smaller copy that retains the same aspect ratio as the original image and does not exceed 1024 pixels in either dimension. For example, a 125x2000 image will be downscaled to 64x1024. Creating this resized, temporary version reduces the amount of data that the model needs to analyze, enabling faster processing.

The model draws all of the bounding boxes, or the regions within an image that define the detected faces. From there, we construct a new, outer bounding box that encompasses all of the individual boxes, calculating its top-left and bottom-right points based on the boxes that are closest to the top, left, bottom, and right edges of the image.

The top-left point uses the x coordinate from the left-most box and the y coordinate from the top-most box. Similarly, the bottom-right point uses the x coordinate from the right-most box and the y coordinate from the bottom-most box. These coordinates can be taken from the same bounding boxes; if a single box is closest to both the top and left edges, then we would use its top-left corner as the top-left point of the outer bounding box.

^{AI face cropping identifies regions that represent faces, then determines an outer bounding box and focal point based on the top-most, left-most, right-most, and bottom-most bounding boxes.}

Once we define the outer bounding box, we use its center coordinates as the focal point when cropping the image. From our experiments, we found that this produced better and more balanced results for images with multiple faces compared to other methods, like establishing the new focal point around the largest detected face.

The cropped image area is calculated based on the dimensions of the outer bounding box (“d”) and a specified zoom level (“z”) in the formula (1 ÷ z) × d. The zoom parameter accepts floating points between 0 and 1, where we crop the image to the bounding box when zoom=1 and include more of the area around the box as zoom trends toward 0.

Consider an original image that is 2048x2048. First, we create an inference image that is 1024x1024 to meet our size limits for face detection. Second, we define the outer bounding box using the model’s predictions—we’ll use 100x500 for this example. At zoom=0.5, our formula generates a crop area that is twice as large as the bounding box, with new width (“w”) and height (“h”) dimensions of 200x1000:

We also apply a min function that chooses the smaller number between the input dimensions and the calculated dimensions, ensuring that the new width and height never exceed the dimensions of the image itself. In other words, if you try to zoom out too much, then we use the full width or height of the image instead of defining a crop area that will extend beyond the edge of the image. For example, at zoom=0.25, our formula yields an initial crop area of 400x2000. Here, since the calculated height (2000) is larger than the input height (1024), we use the input height to set the crop area to 400x1024.

Finally, we need to scale the crop area back to the size of the original image. This applies only when a smaller inference image is created.

We initially downscaled the original 2048x2048 image by a factor of 2 to create the 1024x1024 inference image. This means that we need to multiply the dimensions of the crop area—400x1024 in our latest example—by 2 to produce our final result: a cropped image that is 800x2048.

The architecture behind the earliest build

In the beta version, we rewrote the model using TensorFlow Rust to make it compatible with our existing Rust-based stack. All of the computations for inference—where the model classifies and locates human faces—were executed on CPUs within our network.

Initially, this worked well and we saw near-realtime results.

However, the underlying limitations of our implementation became apparent when we started receiving consistent alerts that our underlying Images service was nearing its limits for memory usage. The increased memory usage didn’t line up with any recent deployments around this time, but a hunch led us to discover that the face cropping compute time graph had an uptick that matched the uptick in memory usage. Further tracing confirmed that AI face cropping was at the root of the problem.

When a service runs out of memory, it terminates its processes to free up memory and prevent the system from crashing. Since CPU-based implementations share RAM with other processes, this can potentially cause errors for other image optimization operations. In response, we switched our memory allocator from glibc malloc to jemalloc. This allowed us to use less memory at runtime, saving about 20 TiB of RAM globally. We also started culling the number of face cropping requests to limit CPU usage.

At this point, AI face cropping was already limited to our own internal uses and a small number of beta customers. These steps only temporarily reduced our memory consumption. They weren’t sufficient for handling global traffic, so we looked toward a more scalable design for long-term use.

Doing more with less (memory)

With memory usage alerts looming in the distance, it became clear that we needed to move to a GPU-based approach.

Unlike with CPUs, a GPU-based implementation avoids contention with other processes because memory access is typically dedicated and managed more tightly. We partnered with the Workers AI team, who created a framework for internal teams to integrate payloads into their model catalog for GPU access.

Some Workers AI models have their own standalone containers; this isn’t practical for every model, as routing traffic to multiple containers can be expensive. When using a GPU through Workers AI, the data needs to travel over the network, which can introduce latency. This is where model size is especially relevant, as network transport overhead becomes more noticeable with larger models.

To address this, Workers AI wraps smaller models in a single container and utilizes a latency-sensitive routing algorithm to identify the best instance to serve each payload. This means that models can be offloaded when there is no traffic.

^{A scheduler is used to optimize how—and when—models in the same container interact with GPUs.}

RetinaFace runs on 1 GB of VRAM on the smallest GPU; it’s small enough that it can be hot swapped at runtime alongside similarly sized models. If there is a call for the RetinaFace model, then the Python code will be loaded into the environment and executed.

As expected, we saw a significant drop in memory usage after we moved the feature to Workers AI. Now, each instance of our Images service consumes about 150 MiB of memory.

With this new approach, memory leaks pose less concern to the overall availability of our service. Workers AI executes models within containers, so they can be terminated and restarted as needed without impacting other processes. Since face cropping runs separately from our Images service, restarting it won’t halt our other image optimization operations.

Applying AI face cropping to our blog

As part of our beta launch, we updated the Cloudflare blog to apply AI face cropping on author images.

Authors can submit their own images, which appear as circular profile pictures in both the main blog feed and individual blog posts. By default, CSS centers images within their containers, making off-centered head positions more obvious. When two profile pictures include different amounts of negative space, this can also lead to a visual imbalance where authors’ faces appear at different scales:

^{AI face cropping makes posts with multiple authors appear more balanced.}

In the example above, Austin’s original image is cropped tightly around his face. On the other hand, Taylor’s original image includes his torso and a larger margin of the background. As a result, Austin’s face appears larger and closer to the center than Taylor’s does. After we applied AI face cropping to profile pictures on the blog, their faces appear more similar in size, creating more balance and cohesion on their co-authored post.

A new era of image editing, now in Images

Many developers already use Images to build scalable media pipelines. Our goal is to accelerate image workflows by automating rote, manual tasks.

For the Images team, this is only the beginning. We plan to release new AI capabilities, including features like background removal and generative upscale. You can try AI face cropping for free by enabling transformations in the Images dashboard.