Do not apply further shard snapshot status updates after shard snapshot is complete#127250
Conversation
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @DiannaHohensee, I've created a changelog YAML for you. |
| // For example, a delayed/retried PAUSED update should not override a completed shard snapshot. | ||
| iterator.remove(); | ||
| return; | ||
| } |
There was a problem hiding this comment.
actual bug fix
| final ShardSnapshotStatus updatedState; | ||
| final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get(); | ||
| final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId); | ||
| if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) { |
There was a problem hiding this comment.
This should always be non-null - the builder starts with a copy of existingShardSnapshotStatuses and just overwrites the bits that get updated.
There was a problem hiding this comment.
Ohh, that took an unexpected turn. Thanks, updated.
| ), | ||
| ActionTestUtils.assertNoFailureListener(t -> {}) | ||
| ), | ||
| // Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS. |
There was a problem hiding this comment.
This is not realistic as a single batch of updates for a single shard - snapshot2 won't start snapshotting this shard until snapshot1 has applied a state update which completes it.
Can we instead randomize the order of the updates? Or randomly add a PAUSED_FOR_NODE_REMOVAL before and/or after the SUCCESS one and verify that the outcome is SUCCESS in every case (and the next snapshot moves from QUEUED to INIT)?
There was a problem hiding this comment.
I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.
server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java
Outdated
Show resolved
Hide resolved
| snapshot1, | ||
| shardId, | ||
| null, | ||
| SnapshotsInProgress.ShardSnapshotStatus.success( |
There was a problem hiding this comment.
This should also apply to PAUSED_FOR_NODE_REMOVAL either side of a FAILED right? Can we randomly choose between SUCCESS and FAILED?
There was a problem hiding this comment.
Good idea, done 👍
…iceTests.java Co-authored-by: David Turner <david.turner@elastic.co>
DiannaHohensee
left a comment
There was a problem hiding this comment.
Thanks for the review, applied the changes 👍
| final ShardSnapshotStatus updatedState; | ||
| final var newShardSnapshotStatusesBuilder = newShardSnapshotStatusesSupplier.get(); | ||
| final var newShardSnapshotStatus = newShardSnapshotStatusesBuilder.get(shardSnapshotId); | ||
| if (newShardSnapshotStatus != null && newShardSnapshotStatus.state().completed()) { |
There was a problem hiding this comment.
Ohh, that took an unexpected turn. Thanks, updated.
| snapshot1, | ||
| shardId, | ||
| null, | ||
| SnapshotsInProgress.ShardSnapshotStatus.success( |
There was a problem hiding this comment.
Good idea, done 👍
| ), | ||
| ActionTestUtils.assertNoFailureListener(t -> {}) | ||
| ), | ||
| // Snapshot 2 will apply PAUSED and then SUCCESS, and the final status will be SUCCESS. |
There was a problem hiding this comment.
I've randomized the order of PAUSED and COMPLETE updates, and randomly add a third PAUSED/COMPLETE update (with a different nodeId to verify it's ignored). And checked the QUEUED -> INIT transition.
DaveCTurner
left a comment
There was a problem hiding this comment.
LGTM (one optional request on the tests)
| // Randomly add another update that will be ignored because the shard snapshot is complete. | ||
| // Note: the originalNodeId is used for this update, so we can verify afterward that the update is not applied. | ||
| randomBoolean() ? completedUpdateOnOriginalNode : pausedUpdateOnOriginalNode |
There was a problem hiding this comment.
Ideally I'd like us to only sometimes have this third update.
There was a problem hiding this comment.
Sure. I've if-else'd it, for lack of a better notion.
DaveCTurner
left a comment
There was a problem hiding this comment.
sorry forgot to press the LGTM button
|
I think this failure matches pre-existing failures. I can't fathom how the other failure is related: |
Relates ES-11375