Skip to content

Avoid stack overflow in IndicesClusterStateService applyClusterState#132536

Merged
fcofdez merged 16 commits intoelastic:mainfrom
albertzaharovits:fix-3855
Aug 27, 2025
Merged

Avoid stack overflow in IndicesClusterStateService applyClusterState#132536
fcofdez merged 16 commits intoelastic:mainfrom
albertzaharovits:fix-3855

Conversation

@albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Aug 7, 2025

Every cluster state applied in the IndicesClusterStateService has the potential to chain a new RefCountingListener to a chain of such listeners. If the chain is too long, the unlucky thread that decreases the ref count to 0 for the head of the listeners chain, ends up calling each listener in turn, and, assuming all ref counts are hence decreased to 0, traversing the whole chain on its thread stack, possibly resulting in a Stackoverflow exception.

This fix chains max 8 RefCountingListener, the 11th one is forked on a generic thread when it gets to execution.

@albertzaharovits albertzaharovits self-assigned this Aug 7, 2025
@albertzaharovits albertzaharovits added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.2.0 v8.19.2 v9.1.2 labels Aug 7, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Aug 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @albertzaharovits, I've created a changelog YAML for you.

@albertzaharovits
Copy link
Contributor Author

Honestly, I think I prefer that every chained listener be executed on a generic thread, for code simplicity's sake.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Also can you cover this in a test?

lastClusterStateShardsClosedListener = new SubscribableListener<>();
currentClusterStateShardsClosedListeners = new RefCountingListener(lastClusterStateShardsClosedListener);
try {
previousShardsClosedListener.addListener(currentClusterStateShardsClosedListeners.acquire());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm are you sure we should move all this listener stuff below doApplyClusterState()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any impact to execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I've put it back at the original place.

@albertzaharovits
Copy link
Contributor Author

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Pushed 3a00599

@albertzaharovits
Copy link
Contributor Author

@DaveCTurner can you take another look please?

I've changed the code to avoid linking listeners when the applied cluster state doesn't close any shards.
I've also added a test that asserts that all the runnables before the oldest shard close listener that's not complete are run, while the others are not.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcofdez
Copy link
Contributor

fcofdez commented Aug 26, 2025

@elasticmachine update branch

@fcofdez
Copy link
Contributor

fcofdez commented Aug 26, 2025

@elasticmachine test this

@fcofdez
Copy link
Contributor

fcofdez commented Aug 27, 2025

@elasticmachine update branch

@fcofdez fcofdez merged commit eb75ba3 into elastic:main Aug 27, 2025
33 checks passed
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this pull request Dec 14, 2025
…lastic#132536)

Every cluster state applied in the IndicesClusterStateService has the potential to chain a new RefCountingListener to a chain of such listeners. If the chain is too long, the unlucky thread that decreases the ref count to 0 for the head of the listeners chain, ends up calling each listener in turn, and, assuming all ref counts are hence decreased to 0, traversing the whole chain on its thread stack, possibly resulting in a Stackoverflow exception.

This fix chains max 8 RefCountingListener, the 11th one is forked on a generic thread when it gets to execution.
elasticsearchmachine pushed a commit that referenced this pull request Dec 14, 2025
…rState (#139499)

* Avoid stack overflow in IndicesClusterStateService applyClusterState (#132536)

Every cluster state applied in the IndicesClusterStateService has the potential to chain a new RefCountingListener to a chain of such listeners. If the chain is too long, the unlucky thread that decreases the ref count to 0 for the head of the listeners chain, ends up calling each listener in turn, and, assuming all ref counts are hence decreased to 0, traversing the whole chain on its thread stack, possibly resulting in a Stackoverflow exception.

This fix chains max 8 RefCountingListener, the 11th one is forked on a generic thread when it gets to execution.

* MockTransportService.createNewService
elasticsearchmachine pushed a commit that referenced this pull request Dec 14, 2025
…State (#139498)

* Avoid stack overflow in IndicesClusterStateService applyClusterState (#132536)

Every cluster state applied in the IndicesClusterStateService has the potential to chain a new RefCountingListener to a chain of such listeners. If the chain is too long, the unlucky thread that decreases the ref count to 0 for the head of the listeners chain, ends up calling each listener in turn, and, assuming all ref counts are hence decreased to 0, traversing the whole chain on its thread stack, possibly resulting in a Stackoverflow exception.

This fix chains max 8 RefCountingListener, the 11th one is forked on a generic thread when it gets to execution.

* MockTransportService.createNewService
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v8.19.10 v9.1.10 v9.2.0

5 participants