Fix race condition in CancellableRateLimitedFluxIterator by nicktindall · Pull Request #141323 · elastic/elasticsearch

nicktindall · 2026-01-27T02:11:21Z

When an error occurs with the download, the reactive framework calls org.elasticsearch.repositories.azure.CancellableRateLimitedFluxIterator#onError which sets done=true, then clears the queue, then sets error=t.

It's possible that hasNext sees the done=true but doesn't see the error=t which will mean hasNext can return false, which we take to mean we've reached the end of the sequence of chunks, when in fact the iteration terminated due to an error. hasNext should throw the error in this case so the consumer knows they've read an incomplete sequence.

This iterator terminates with an error rather infrequently because currently we rely on retries in the Azure client (i.e. we'd only see it terminate with an error once the configured retries are exhausted). With the shift to using the common retry infrastructure, this code path will be executed for every individual failure, so it becomes more likely that we will hit this race condition.

This change modifies the update to done and error to make it atomic. This will remove the above edge case.

elasticsearchmachine · 2026-01-27T02:11:58Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2026-01-27T02:12:21Z

Hi @nicktindall, I've created a changelog YAML for you.

nicktindall · 2026-01-27T02:16:55Z

...e/src/main/java/org/elasticsearch/repositories/azure/CancellableRateLimitedFluxIterator.java

-            cancelSubscription();
-            signalConsumer();
+            if (doneState.done() && doneState.error() != null) {
+                throw new RuntimeException(doneState.error());


It's possible we responded true to hasNext() then an error occurs between then and the call to next(). We'd have cleared the queue in the meantime, meaning we end up here. If that is the case, just throw the error that occurred.

Note that in onError we set the doneState THEN call clearQueue, so in the above scenario, if we see an empty queue here, we'll see the error in doneState. It may mean that we allow consumption of an item after the error occurs, but that's not a big deal because the caller will receive the error on their next call to hasNext.

nicktindall · 2026-01-27T02:22:13Z

.../test/java/org/elasticsearch/repositories/azure/CancellableRateLimitedFluxIteratorTests.java

+            running.set(false);
+            safeAwait(endBarrier);
+        }
+    }


This test reproduces the issue fairly quickly if you run with

while ./gradlew :modules:repository-azure:test --tests "org.elasticsearch.repositories.azure.CancellableRateLimitedFluxIteratorTests.testConcurrentErrorAndHasNext" -Dtests.iters=1000 --rerun --fail-fast; do echo again; done

ywangd

LGTM

ywangd · 2026-01-27T04:53:04Z

...e/src/main/java/org/elasticsearch/repositories/azure/CancellableRateLimitedFluxIterator.java

+    /**
+     * This is used to set 'done' and 'error' atomically
+     */
+    private record DoneState(boolean done, Throwable error) {


Can we assert something like assert done == true || error == null?

Added in 3744b53

ywangd · 2026-01-27T05:12:02Z

.../test/java/org/elasticsearch/repositories/azure/CancellableRateLimitedFluxIteratorTests.java

+        final var endBarrier = new CyclicBarrier(3);
+        final var running = new AtomicBoolean(true);
+        final var outstandingRequests = new AtomicInteger(0);
+        for (int i = 0; i < 20; i++) {


Is this loop necessary? It is basically here to run the test 20 times sequentially so that it is more likely to reproduce the failure?

Yeah it was just to make it more likely to reproduce. It still only takes ~15ms so I think it's good just to make it more likely to surface the issue?

That's ok. Mostly making sure my understanding is correct.

elasticsearchmachine · 2026-01-27T06:46:54Z

💚 Backport successful

Status	Branch	Result
✅	9.3
✅	9.2

)

…141328)

…141327)

)

Fix race condition in CancellableRateLimitedFluxIterator

b0776e7

nicktindall added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jan 27, 2026

elasticsearchmachine added Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.4.0 labels Jan 27, 2026

Update docs/changelog/141323.yaml

62d7b1e

nicktindall commented Jan 27, 2026

View reviewed changes

nicktindall requested a review from ywangd January 27, 2026 02:22

ywangd approved these changes Jan 27, 2026

View reviewed changes

Add assertion on done/error

3744b53

nicktindall added auto-backport Automatically create backport pull requests when merged v9.3.1 v9.2.5 labels Jan 27, 2026

nicktindall merged commit 7c244ee into elastic:main Jan 27, 2026
35 of 36 checks passed

This was referenced Jan 27, 2026

[9.3] Fix race condition in CancellableRateLimitedFluxIterator (#141323) #141327

Merged

[9.2] Fix race condition in CancellableRateLimitedFluxIterator (#141323) #141328

Merged

nicktindall added a commit to nicktindall/elasticsearch that referenced this pull request Jan 27, 2026

Fix race condition in CancellableRateLimitedFluxIterator (elastic#141323

371036d

)

nicktindall deleted the fix_race_condition_flux_iterator branch January 27, 2026 06:48

nicktindall added a commit to nicktindall/elasticsearch that referenced this pull request Jan 27, 2026

Fix race condition in CancellableRateLimitedFluxIterator (elastic#141323

0897018

)

nicktindall mentioned this pull request Jan 27, 2026

Reapply "Use common retry logic for Azure (#139422)" #141329

Merged

elasticsearchmachine pushed a commit that referenced this pull request Jan 27, 2026

Fix race condition in CancellableRateLimitedFluxIterator (#141323) (#��

f3563a4

…141328)

elasticsearchmachine pushed a commit that referenced this pull request Jan 27, 2026

Fix race condition in CancellableRateLimitedFluxIterator (#141323) (#…

296c01a

…141327)

schase-es pushed a commit to schase-es/elasticsearch that referenced this pull request Jan 28, 2026

Fix race condition in CancellableRateLimitedFluxIterator (elastic#141323

db54de7

)

nicktindall mentioned this pull request Feb 5, 2026

Fix CancellableRateLimitedFluxIteratorTests, fix potential race condition #141896

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in CancellableRateLimitedFluxIterator#141323

Fix race condition in CancellableRateLimitedFluxIterator#141323
nicktindall merged 3 commits intoelastic:mainfrom
nicktindall:fix_race_condition_flux_iterator

nicktindall commented Jan 27, 2026 •

edited

Loading

elasticsearchmachine commented Jan 27, 2026

elasticsearchmachine commented Jan 27, 2026

nicktindall Jan 27, 2026 •

edited

Loading

nicktindall Jan 27, 2026 •

edited

Loading

ywangd left a comment

ywangd Jan 27, 2026

nicktindall Jan 27, 2026

ywangd Jan 27, 2026

nicktindall Jan 27, 2026

ywangd Jan 27, 2026

Uh oh!

elasticsearchmachine commented Jan 27, 2026

Labels

3 participants

Conversation

nicktindall commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Jan 27, 2026

elasticsearchmachine commented Jan 27, 2026

nicktindall Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

nicktindall Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

ywangd Jan 27, 2026

Choose a reason for hiding this comment

nicktindall Jan 27, 2026

Choose a reason for hiding this comment

ywangd Jan 27, 2026

Choose a reason for hiding this comment

nicktindall Jan 27, 2026

Choose a reason for hiding this comment

ywangd Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jan 27, 2026

💚 Backport successful

Labels

3 participants

nicktindall commented Jan 27, 2026 •

edited

Loading

nicktindall Jan 27, 2026 •

edited

Loading

nicktindall Jan 27, 2026 •

edited

Loading