Fix race condition in RemoteClusterService.collectNodes()#131937
Fix race condition in RemoteClusterService.collectNodes()#131937JeremyDahlgren merged 9 commits intoelastic:mainfrom
RemoteClusterService.collectNodes()#131937Conversation
It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes(). This change adds a test that produces the NullPointerException and adds a fix.
|
Hi @JeremyDahlgren, I've created a changelog YAML for you. |
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
|
||
| @Override | ||
| public void onFailure(Exception e) { | ||
| assertNotNull(e); |
There was a problem hiding this comment.
I think we can verify the exception is NoSuchRemoteClusterException.
There was a problem hiding this comment.
I put the test in a loop and we can hit a few different exceptions here. I refactored to test them as done in testCollectNodes() above.
server/src/test/java/org/elasticsearch/transport/RemoteClusterServiceTests.java
Outdated
Show resolved
Hide resolved
DaveCTurner
left a comment
There was a problem hiding this comment.
See inline comment about early-completion. Also could we rework this to just do the lookup with get rather than the pre-flight check? You'd have to store the looked-up connections in some intermediate collection but that's ok, this shouldn't be on a hot path and the collection won't be unmanageably massive.
| RemoteClusterConnection connection = this.remoteClusters.get(cluster); | ||
| // Ensure the connection is not null, it could have been removed since the containsKey() call above. | ||
| if (connection == null) { | ||
| if (countDown.fastForward()) { |
There was a problem hiding this comment.
I'm not a fan of using fastForward like this - it means the listener completes early, potentially while other collectNodes calls are still running, which can cause duplicated work and other confusions. The same concern applies to the fastForward call that already existed.
Instead, could we adjust this all to use a RefCountingListener? That would (a) delay the completion properly and (b) collect multiple failures (up to 10) as well as (c) being more idiomatic of modern Elasticsearch.
There was a problem hiding this comment.
Thanks David, much cleaner!
…131937) It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes(). This change adds a test that produces the NullPointerException and adds a fix.
…131937) It is possible for a linked remote to get unlinked in between the containsKey() and get() calls in collectNodes(). This change adds a test that produces the NullPointerException and adds a fix.
It is possible for a linked remote to get unlinked in between the
containsKey()andget()calls incollectNodes().This change adds a test that produces the
NullPointerExceptionand adds a fix.