Drain responses on completion for TransportNodesAction by ywangd · Pull Request #130303 · elastic/elasticsearch

ywangd · 2025-06-30T06:26:51Z

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation.

Resolves: #128852

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: elastic#128852

elasticsearchmachine · 2025-06-30T06:27:17Z

Hi @ywangd, I've created a changelog YAML for you.

elasticsearchmachine · 2025-06-30T06:27:18Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java

DaveCTurner · 2025-06-30T07:05:08Z

server/src/test/java/org/elasticsearch/action/support/nodes/TransportNodesActionTests.java

+            ) {
+                final var waited = new AtomicBoolean();
+                for (var response : testNodeResponses) {
+                    if (waited.compareAndSet(false, true)) {


This is kind of a convoluted way to wait on a nonempty list. There's no concurrency here so the compareAndSet is a bit of a sledgehammer. Can we just check testNodeResponses.isEmpty()?

This is to wait for only the first response. You are right there is no need for AtomicBoolean. I changed it to a primitive boolean variable.

DaveCTurner · 2025-06-30T07:48:34Z

server/src/test/java/org/elasticsearch/action/support/nodes/TransportNodesActionTests.java

+                boolean waited = false;
+                for (var response : testNodeResponses) {
+                    if (waited == false) {
+                        waited = true;
+                        safeAwait(barrier);
+                        safeAwait(barrier);
+                    }
+                }


Can we not just do this?

Suggested change

boolean waited = false;

for (var response : testNodeResponses) {

if (waited == false) {

waited = true;

safeAwait(barrier);

safeAwait(barrier);

}

}

if (testNodeResponses.isEmpty() == false) {

safeAwait(barrier);

safeAwait(barrier);

}

Indeed can we not assert that testNodeResponses is nonempty in this test?

The for-loop is to reproduce the ConcurrentModificationException reported in #128852. The test always passes without it.

I see, could you add a comment to that effect or else this'll get "tidied up"

Comment added in fdf0b22

DaveCTurner · 2025-06-30T07:49:46Z

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java

+                        assert task instanceof CancellableTask : "expect CancellableTask, but got: " + task;
+                        final var cancellableTask = (CancellableTask) task;
+                        assert cancellableTask.isCancelled();
+                        throw new TaskCancelledException("task cancelled [" + cancellableTask.getReasonCancelled() + "]");


getReasonCancelled is racy according to its Javadocs: "May also be null if the task was just cancelled since we don't set the reason and the cancellation flag atomically." You need to use notifyIfCancelled to get the right behaviour here.

Thanks. Pushed 3d07261. Please let me know if it has used the right listener.

ywangd · 2025-06-30T07:56:54Z

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java

+                        logger.debug("task cancelled after all responses were collected");
+                        assert task instanceof CancellableTask : "expect CancellableTask, but got: " + task;
+                        final var cancellableTask = (CancellableTask) task;
+                        assert cancellableTask.isCancelled();
+                        throw new TaskCancelledException("task cancelled [" + cancellableTask.getReasonCancelled() + "]");


This change is to address the edge case commented here. But I struggle to write a test for it. Essentially we need the cancel to comes in after all node responses are collected but before the AtomicBoolean responsesHandled is checked. One option is to extract the creation of CancellableFanOut into its own protected method plus wrapping the returned value with a delgating CancellableFanOut. But this requires making the 4 protected methods in CancellableFanOut package private. I am a bit suspicous on whether this is the right path to go down. I am open to suggestions.

I'd be content with a test which concurrently completes the action and cancels it, and asserts that we always either get an exception or we get a successful response. I expect such a test would find the bug here pretty reliably.

Cool I added such a test, see fb71e89

DaveCTurner · 2025-06-30T09:04:46Z

server/src/test/java/org/elasticsearch/action/support/nodes/TransportNodesActionTests.java

+
+        try {
+            final var testNodesResponse = future.actionGet(SAFE_AWAIT_TIMEOUT);
+            assertFalse(cancellableTask.isCancelled());


I don't think this'll hold in general, we could cancel the task after the completion has already passed the point of no return and then the task's cancellation flag will be set even though it completed successfully.

Yeah good point, Thanks. I removed that in b38783d which also contains a few other tweaks.

DaveCTurner

LGTM

ywangd · 2025-07-01T23:28:38Z

@elasticmachine update branch

ywangd · 2025-07-02T02:31:41Z

@elasticmachine update branch

ywangd · 2025-07-02T22:49:02Z

@elasticmachine update branch

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: elastic#128852

elasticsearchmachine · 2025-07-03T00:27:19Z

💚 Backport successful

Status	Branch	Result
✅	8.19
✅	9.1
✅	9.0

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: elastic#128852

…0514) This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: #128852

…0513) This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: #128852

…0515) This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: #128852

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: elastic#128852

Drain responses on completion for TransportNodesAction

a7daa50

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: elastic#128852

ywangd requested review from DaveCTurner and nicktindall June 30, 2025 06:26

ywangd added >bug v9.0.0 v8.19.0 v9.1.0 :Distributed Coordination/Distributed v9.2.0 labels Jun 30, 2025

elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jun 30, 2025

Update docs/changelog/130303.yaml

6e8bfbe

ywangd added v9.0.4 and removed v9.0.0 labels Jun 30, 2025

unwanted change

670a175

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java Outdated Show resolved Hide resolved

Use atomicBoolean

4adb4a6

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java Outdated Show resolved Hide resolved

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java Outdated Show resolved Hide resolved

move comment

f990e43

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

more edge case

3d6140e

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

ywangd commented Jun 30, 2025

View reviewed changes

ywangd and others added 4 commits June 30, 2025 18:03

notify cancel

3d07261

[CI] Auto commit changes from spotless

e06c981

test concurrently completing and cancelling

fb71e89

tweak name

9dcbbd0

ywangd requested a review from DaveCTurner June 30, 2025 08:40

ywangd added 2 commits June 30, 2025 19:02

comment for loop

fdf0b22

Merge remote-tracking branch 'origin/main' into es-128852-fix

b521449

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

ywangd added 2 commits July 1, 2025 12:48

remove assertion for task cancellation

b38783d

Merge remote-tracking branch 'origin/main' into es-128852-fix

d40d44b

ywangd requested a review from DaveCTurner July 1, 2025 03:00

wording

5a0f186

DaveCTurner approved these changes Jul 1, 2025

View reviewed changes

Merge branch 'main' into es-128852-fix

976203b

ywangd added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged labels Jul 1, 2025

Merge branch 'main' into es-128852-fix

6b938fa

Merge branch 'main' into es-128852-fix

7c2c42d

elasticsearchmachine merged commit 74fd66c into elastic:main Jul 3, 2025
33 checks passed

ywangd deleted the es-128852-fix branch July 3, 2025 00:26

This was referenced Jul 3, 2025

[8.19] Drain responses on completion for TransportNodesAction (#130303) #130513

Merged

[9.1] Drain responses on completion for TransportNodesAction (#130303) #130514

Merged

ywangd mentioned this pull request Jul 3, 2025

[9.0] Drain responses on completion for TransportNodesAction (#130303) #130515

Merged

repantis added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Coordination/Distributed labels Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drain responses on completion for TransportNodesAction#130303

Drain responses on completion for TransportNodesAction#130303
elasticsearchmachine merged 20 commits intoelastic:mainfrom
ywangd:es-128852-fix

ywangd commented Jun 30, 2025

elasticsearchmachine commented Jun 30, 2025

elasticsearchmachine commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

DaveCTurner Jun 30, 2025

ywangd Jun 30, 2025

DaveCTurner Jun 30, 2025

ywangd Jun 30, 2025

DaveCTurner Jun 30, 2025

ywangd Jun 30, 2025

DaveCTurner Jun 30, 2025

ywangd Jun 30, 2025

ywangd Jun 30, 2025

DaveCTurner Jun 30, 2025

ywangd Jun 30, 2025

DaveCTurner Jun 30, 2025

ywangd Jul 1, 2025

DaveCTurner left a comment

ywangd commented Jul 1, 2025

ywangd commented Jul 2, 2025

ywangd commented Jul 2, 2025

Uh oh!

elasticsearchmachine commented Jul 3, 2025

Labels

5 participants

Conversation

ywangd commented Jun 30, 2025

elasticsearchmachine commented Jun 30, 2025

elasticsearchmachine commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

ywangd commented Jul 1, 2025

ywangd commented Jul 2, 2025

ywangd commented Jul 2, 2025

Uh oh!

elasticsearchmachine commented Jul 3, 2025

💚 Backport successful

Labels

5 participants