Avoid counting snapshot failures twice in SLM#136759
Avoid counting snapshot failures twice in SLM#136759nielsbauman merged 2 commits intoelastic:mainfrom
Conversation
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Hi @nielsbauman, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Pull Request Overview
Prevents double-counting snapshot failures toward invocationsSinceLastSuccess in SLM when multiple snapshot failure listeners process the same failures concurrently.
- Introduces snapshotIsRegistered flag to gate incrementing invocationsSinceLastSuccess
- Adds initiatingSnapshot to test setup for failure cleanup scenario
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java | Adds snapshotIsRegistered boolean and guards failure invocation increment to avoid double counting. |
| x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTaskTests.java | Updates test input list to include initiatingSnapshot for scenario coverage. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java
Show resolved
Hide resolved
| snapshotInfoFailure1.snapshotId(), | ||
| snapshotInfoFailure2.snapshotId() | ||
| snapshotInfoFailure2.snapshotId(), | ||
| initiatingSnapshot |
There was a problem hiding this comment.
This test was failing with my other changes. I think this test was originally wrong; since this WriteJobStatus runs for initiatingSnapshot, it should also be present in the registered snapshots. Please correct me if my reasoning is wrong!
There was a problem hiding this comment.
I think you are right, the test was passing before because of this exact bug that you are fixing.
samxbr
left a comment
There was a problem hiding this comment.
Thanks Niels for looking into this! The fix looks good, my only comment its that it's probably worth adding a test to cover for this scenario to prevent regression (and to absolutely prove our theory!). You can check out SLMSnapshotBlockingIntegTests and SLMStatDisruptionIT for some similar test examples.
|
Thanks for having a look, @samxbr. I think that's a great suggestion. However, I've been trying to create an integration test for some time now, but haven't managed to do so... I tried in both the test classes you suggested, and I came closest in The timing is pretty tricky, as the |
Have you tried using I am approving this PR for now, feel free to merge if the test is taking too long (I can give it a try later when I have time too). I don't want to block this fix because it can potentially prevent some OBS alerts. |
|
Yeah, I tried using cleanupOldMetadata is where we call FinalizeSnapshotContext#onDone.
So, I modified diff --git a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
index 7c9eb5652e9..6e3333a1910 100644
--- a/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
+++ b/x-pack/plugin/slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMStatDisruptionIT.java
@@ -260,13 +260,12 @@ public class SLMStatDisruptionIT extends AbstractSnapshotIntegTestCase {
fsc.clusterMetadata(),
fsc.snapshotInfo(),
fsc.repositoryMetaVersion(),
- fsc,
- () -> {
+ ActionListener.runBefore(fsc, () -> {
// run the passed lambda before calling the usual callback
// this is where the cluster can be restarted before SLM is called back with the snapshotInfo
beforeResponseRunnable.run();
- fsc.onDone();
- }
+ }),
+ fsc::onDone
);
super.finalizeSnapshot(newFinalizeContext);
}which does what I want; it actually blocks the SLM listener from being run. However, that results in another issue, because due to this line: we can only "finalize" one snapshot at a time per repository. That means that if we block the SLM listener until the second snapshot is also registered as failed, we get a deadlock, as the second snapshot will never be registered as failed if the first one isn't fully "finalized" yet. In short, what we need is the ability to block somewhere before we submit this cluster state task: but I wasn't able to find a way to do that. If you have any suggestions, I'd be more than happy to hear them! I'll go ahead and merge this PR tomorrow. We can always add a test retrospectively if we come up with one. |
💚 Backport successful
|
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the
invocationsSinceLastSuccess), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis.We simply avoid incrementing
invocationsSinceLastSuccessif the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.