Skip to content

Fix race condition in RemoteClusterService.collectNodes()#131937

Merged
JeremyDahlgren merged 9 commits intoelastic:mainfrom
JeremyDahlgren:fix/rcs-collect-nodes-race-condition
Jul 30, 2025
Merged

Fix race condition in RemoteClusterService.collectNodes()#131937
JeremyDahlgren merged 9 commits intoelastic:mainfrom
JeremyDahlgren:fix/rcs-collect-nodes-race-condition

Conversation

@JeremyDahlgren
Copy link
Contributor

It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException and adds a fix.

It is possible for a linked remote to get unlinked in
between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException
and adds a fix.
@JeremyDahlgren JeremyDahlgren added >bug Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. :Distributed Coordination/Distributed v9.2.0 labels Jul 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @JeremyDahlgren, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


@Override
public void onFailure(Exception e) {
assertNotNull(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can verify the exception is NoSuchRemoteClusterException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the test in a loop and we can hit a few different exceptions here. I refactored to test them as done in testCollectNodes() above.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment about early-completion. Also could we rework this to just do the lookup with get rather than the pre-flight check? You'd have to store the looked-up connections in some intermediate collection but that's ok, this shouldn't be on a hot path and the collection won't be unmanageably massive.

RemoteClusterConnection connection = this.remoteClusters.get(cluster);
// Ensure the connection is not null, it could have been removed since the containsKey() call above.
if (connection == null) {
if (countDown.fastForward()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of using fastForward like this - it means the listener completes early, potentially while other collectNodes calls are still running, which can cause duplicated work and other confusions. The same concern applies to the fastForward call that already existed.

Instead, could we adjust this all to use a RefCountingListener? That would (a) delay the completion properly and (b) collect multiple failures (up to 10) as well as (c) being more idiomatic of modern Elasticsearch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David, much cleaner!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JeremyDahlgren JeremyDahlgren merged commit ccf9893 into elastic:main Jul 30, 2025
33 checks passed
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Jul 31, 2025
…131937)

It is possible for a linked remote to get unlinked in
between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException
and adds a fix.
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Jul 31, 2025
…131937)

It is possible for a linked remote to get unlinked in
between the containsKey() and get() calls in collectNodes().
This change adds a test that produces the NullPointerException
and adds a fix.
@repantis repantis added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Coordination/Distributed labels Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.2.0

5 participants