Skip to content

Ensure paused shard snapshot can be deleted#141408

Merged
elasticsearchmachine merged 3 commits intoelastic:mainfrom
ywangd:ensure-paused-shard-snapshot-deletable
Jan 28, 2026
Merged

Ensure paused shard snapshot can be deleted#141408
elasticsearchmachine merged 3 commits intoelastic:mainfrom
ywangd:ensure-paused-shard-snapshot-deletable

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Jan 28, 2026

When a shard snapshot is paused due to node shutdown, the associated snapshot can be deleted before the shard snapshot transition to another state. When this happens, we ensure such shard snapshot is deleted directly without going back to the data node where it gets incorrectly ignored.

When a shard snapshot is paused due to node shutdown, the associated
snapshot can be deleted before the shard snapshot transition to another
state. When this happens, we ensure such shard snapshot is deleted
directly without going back to the data node which is incorrectly
ignored.
@ywangd ywangd requested a review from DaveCTurner January 28, 2026 06:59
@ywangd ywangd added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs auto-backport Automatically create backport pull requests when merged v9.3.1 v9.4.0 v9.2.6 labels Jan 28, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jan 28, 2026
@ywangd
Copy link
Member Author

ywangd commented Jan 28, 2026

See this buildscan for the stress test failure due to this bug. When a shard snapshot is already PAUSED on the data node, an abort signal from the master node is simply ignored. I think there is no need for publishing the abort change to the data node and instead the master node can update the shard snapshot as FAILED directly. This PR does that.

@ywangd
Copy link
Member Author

ywangd commented Jan 28, 2026

Related, if the snapshot is deleted when the shard snapshot is PAUSING, the data node also ignores the abort signal. But when it completes the pause and publish the PAUSED_FOR_NODE_REMOVAL state change, master correctly moves the ABORTED shard snapshot to FAILED

if (existing.state() == ShardState.ABORTED
&& shardSnapshotStatusUpdate.updatedState.state() == ShardState.PAUSED_FOR_NODE_REMOVAL) {
// concurrently pausing the shard snapshot due to node shutdown and aborting the snapshot - this shard is no longer
// actively snapshotting but we don't want it to resume, so mark it as FAILED since it didn't complete

@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch LGTM

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 28, 2026
@elasticsearchmachine elasticsearchmachine merged commit a2b9254 into elastic:main Jan 28, 2026
35 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.3 Commit could not be cherrypicked due to conflicts
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 141408

@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.3 Commit could not be cherrypicked due to conflicts
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 141408

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jan 28, 2026
When a shard snapshot is paused due to node shutdown, the associated
snapshot can be deleted before the shard snapshot transition to another
state. When this happens, we ensure such shard snapshot is deleted
directly without going back to the data node where it gets incorrectly
ignored.

(cherry picked from commit a2b9254)

# Conflicts:
#	server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotShutdownIT.java
@ywangd
Copy link
Member Author

ywangd commented Jan 28, 2026

💚 All backports created successfully

Status Branch Result
9.3
9.2

Questions ?

Please refer to the Backport tool documentation

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jan 28, 2026
When a shard snapshot is paused due to node shutdown, the associated
snapshot can be deleted before the shard snapshot transition to another
state. When this happens, we ensure such shard snapshot is deleted
directly without going back to the data node where it gets incorrectly
ignored.

(cherry picked from commit a2b9254)

# Conflicts:
#	server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotShutdownIT.java
elasticsearchmachine pushed a commit that referenced this pull request Jan 28, 2026
* Ensure paused shard snapshot can be deleted (#141408)

When a shard snapshot is paused due to node shutdown, the associated
snapshot can be deleted before the shard snapshot transition to another
state. When this happens, we ensure such shard snapshot is deleted
directly without going back to the data node where it gets incorrectly
ignored.

(cherry picked from commit a2b9254)

# Conflicts:
#	server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotShutdownIT.java

* import
elasticsearchmachine pushed a commit that referenced this pull request Jan 28, 2026
* Ensure paused shard snapshot can be deleted (#141408)

When a shard snapshot is paused due to node shutdown, the associated
snapshot can be deleted before the shard snapshot transition to another
state. When this happens, we ensure such shard snapshot is deleted
directly without going back to the data node where it gets incorrectly
ignored.

(cherry picked from commit a2b9254)

# Conflicts:
#	server/src/internalClusterTest/java/org/elasticsearch/snapshots/SnapshotShutdownIT.java

* import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.2.6 v9.3.1 v9.4.0

3 participants