Fetch the tracked alerts without depending on the task state by ersin-erdal · Pull Request #235253 · elastic/kibana

ersin-erdal · 2025-09-16T17:40:02Z

Resolves: #190376

Rule execution fails after persisting alerts, therefore the alerts and the execution-uuids in the task state cannot be updated.
On the next execution, same alert is reported but since the last execution-uuid wasn't added to the task state, the alert doc doesn't come in the tracked alerts. Therefore it is considered as a new alert but as it was already persistent in the previous execution gets a conflict error.

This PR solves this problem by fetching the tracked alerts without depending on the task state.

The new query groups the alerts of the running rule by execution-uuid. And fetches the number of executions as much as the flapping lookback window. Each execution-uuid group returns all the alerts belongs to itself under inner_hits

To verify:

Create an always firing Elasticsearch Query rule with 1 hour run interval.
Let the rule run and create an alert.
Apply the below diff

diff --git a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts
index a5dcfa0d5f5..bb60c761740 100644
--- a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts
+++ b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts
@@ -438,6 +438,8 @@ export class TaskRunner<
       recoveredAlertsToReturn = alerts.rawRecoveredAlerts;
     }

+    throw new Error('fail');
+
     return {
       metrics: ruleRunMetricsStore.getMetrics(),
       state: {

Wait for Kibana to restart
Run the rule on the UI by using "Run rule"
Observe the error message on the terminal
Remove the above change and wait for Kibana to restart.
Run the rule on the UI by using "Run rule"

Rule should run without any error and update the alert and the task state.

The same scenario should be failing on main.

elasticmachine · 2025-09-17T13:41:40Z

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1 · 2025-09-17T17:05:44Z

x-pack/platform/plugins/shared/alerting/server/alerts_client/alerts_client.ts

        const result = await this.search({
-          size: (opts.maxAlerts || DEFAULT_MAX_ALERTS) * 2,
-          seq_no_primary_term: true,
+          size: opts.flappingSettings.lookBackWindow,


Do you think we need the last 20 executions worth of alerts? Alerts from the most recent execution should each carry along their own flapping history and we update "ongoing recovered" alerts for flapping with the latest execution UUID. I'm worried in the worst case, we'll be returning 20 x 1000 alerts whereas previously we'd be returning 2 * 1000. Or am I misunderstanding the query?

Yeah I also realized that but lesser size may cause to miss some of the alerts of the last execution.
Yeah in the worst case scenario - if the rule generates 1000 new alerts on each execution- the query returns 20.000 alerts. Under normal circumstances, even if it generates 1000 alerts, they remain as ongoing and only the last execution would carry them.

Actually this is the main difference between this query and the old getTrackedAlertsByExecutionUuids. Both returns the alerts of the last 20 executions. But the old one has limit of 2000 for all the alerts.

As discussed offline, we should try to find a way to avoid the worst case scenario, where we return 1000 alerts from 20 previous executions.

Possible options:

splitting the query into 2 - first a collapse query to get the last 20 execution UUIDs (without the inner hits clause) and then a second query to use the execution UUIDs to get alerts (limits the number of alerts that can be returned)

seeing if there's a way to limit the total size of inner hits returned within the single query

setting a flag on the ongoing recovered alerts to indicate that they shouldn't be returned for summary alerts queries while still updating the execution UUID to the latest.

x-pack/platform/plugins/shared/alerting/server/alerts_client/alerts_client.ts

ymao1 · 2025-09-19T17:59:21Z

x-pack/platform/plugins/shared/alerting/server/alerts_client/alerts_client.ts

-                { terms: { [ALERT_UUID]: uuidsToFetch } },
-              ],
+              must: [{ term: { [ALERT_RULE_UUID]: this.options.rule.id } }],
+              filter: [{ terms: { [ALERT_RULE_EXECUTION_UUID]: executionUuids } }],


do you think we need to exclude status: untracked in this query? we didn't before, but I think that might have been an oversight?

Oh! thanks for pointing out. I overlooked it.
Actually the filter should be here, having it in the other one may cause to skip an execution.
Done.

ymao1 · 2025-09-19T17:59:33Z

x-pack/platform/plugins/shared/alerting/server/alerts_client/alerts_client.ts

-        const result = await this.search({
-          size: uuidsToFetch.length,
+        const alerts = await this.search({
+          size: (opts.maxAlerts || DEFAULT_MAX_ALERTS) * 2,


should we sort by @timestamp here too?

Would be useless IMO, we just need all the alerts from the last 20 executions, order doesn't matter.

ymao1

LGTM. Left a small nit.

Verified creating a rule on main that creates active alerts, and switching to this branch. Rule continues running and getting alerts correctly. Verified throwing an error as described in verification instructions. Rule runs in next execution with no error. Verified downgrading back to main and running rule, rule runs correctly, using alert UUIDs from task state to continue getting alerts.

Approving but would love to get @doakalexi to take a look at the flapping logic to ensure the "ongoing recovered" alerts are queried correctly for flapping purposes.

ymao1 · 2025-09-22T13:31:11Z

x-pack/platform/plugins/shared/alerting/server/alerts_client/alerts_client.ts

-        if (uuidsToFetch.length <= 0) {
-          return [];
-        }
+        const executionUuids = executions.hits


Not sure if anything can go wrong with the query results but to be safe, could we add some optional accessors here and default to []? Like (execution?.hits ?? [])

doakalexi

Approving but would love to get @doakalexi to take a look at the flapping logic to ensure the "ongoing recovered" alerts are queried correctly for flapping purposes.

Tested locally with flapping alerts, and LGTM! The alerts behaved as expected.

elasticmachine · 2025-09-22T17:22:12Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: ed487c7

Failed CI Steps

FTR Configs #114

Test Failures

[job] [logs] FTR Configs #114 / Cloud Security Posture Test adding Cloud Security Posture Integrations CSPM AWS CIS_AWS Single Manual Temporary Keys CIS_AWS Single Manual Temporary Keys Workflow

Metrics [docs]

✅ unchanged

History

💔 Build #341801 failed d90ed59
💔 Build #341678 failed c3fa9c5
💔 Build #341605 failed b179fdb
💔 Build #341585 failed cfa61e8
💚 Build #340583 succeeded 26cc601

…#235253) Resolves: elastic#190376 Rule execution fails after persisting alerts, therefore the alerts and the execution-uuids in the task state cannot be updated. On the next execution, same alert is reported but since the last execution-uuid wasn't added to the task state, the alert doc doesn't come in the tracked alerts. Therefore it is considered as a new alert but as it was already persistent in the previous execution gets a conflict error. This PR solves this problem by fetching the tracked alerts without depending on the task state. The new query groups the alerts of the running rule by execution-uuid. And fetches the number of executions as much as the flapping lookback window. Each execution-uuid group returns all the alerts belongs to itself under `inner_hits` ### To verify: 1. Create an always firing Elasticsearch Query rule with `1 hour` run interval. 2. Let the rule run and create an alert. 3. Apply the below diff ``` diff --git a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts index a5dcfa0..bb60c761740 100644 --- a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts +++ b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts @@ -438,6 +438,8 @@ export class TaskRunner< recoveredAlertsToReturn = alerts.rawRecoveredAlerts; } + throw new Error('fail'); + return { metrics: ruleRunMetricsStore.getMetrics(), state: { ``` 4. Wait for Kibana to restart 5. Run the rule on the UI by using "Run rule" 6. Observe the error message on the terminal 7. Remove the above change and wait for Kibana to restart. 8. Run the rule on the UI by using "Run rule" Rule should run without any error and update the alert and the task state. The same scenario should be failing on main.

Resolves: #190376 Rule execution fails after persisting alerts, therefore the alerts and the execution-uuids in the task state cannot be updated. On the next execution, same alert is reported but since the last execution-uuid wasn't added to the task state, the alert doc doesn't come in the tracked alerts. Therefore it is considered as a new alert but as it was already persistent in the previous execution gets a conflict error. This PR solves this problem by fetching the tracked alerts without depending on the task state. The new query groups the alerts of the running rule by execution-uuid. And fetches the number of executions as much as the flapping lookback window. Each execution-uuid group returns all the alerts belongs to itself under `inner_hits` ### To verify: 1. Create an always firing Elasticsearch Query rule with `1 hour` run interval. 2. Let the rule run and create an alert. 3. Apply the below diff ``` diff --git a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts index a5dcfa0..bb60c761740 100644 --- a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts +++ b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts @@ -438,6 +438,8 @@ export class TaskRunner< recoveredAlertsToReturn = alerts.rawRecoveredAlerts; } + throw new Error('fail'); + return { metrics: ruleRunMetricsStore.getMetrics(), state: { ``` 4. Wait for Kibana to restart 5. Run the rule on the UI by using "Run rule" 6. Observe the error message on the terminal 7. Remove the above change and wait for Kibana to restart. 8. Run the rule on the UI by using "Run rule" Rule should run without any error and update the alert and the task state. The same scenario should be failing on main.

…#235253) Resolves: elastic#190376 Rule execution fails after persisting alerts, therefore the alerts and the execution-uuids in the task state cannot be updated. On the next execution, same alert is reported but since the last execution-uuid wasn't added to the task state, the alert doc doesn't come in the tracked alerts. Therefore it is considered as a new alert but as it was already persistent in the previous execution gets a conflict error. This PR solves this problem by fetching the tracked alerts without depending on the task state. The new query groups the alerts of the running rule by execution-uuid. And fetches the number of executions as much as the flapping lookback window. Each execution-uuid group returns all the alerts belongs to itself under `inner_hits` ### To verify: 1. Create an always firing Elasticsearch Query rule with `1 hour` run interval. 2. Let the rule run and create an alert. 3. Apply the below diff ``` diff --git a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts index a5dcfa0..bb60c761740 100644 --- a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts +++ b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts @@ -438,6 +438,8 @@ export class TaskRunner< recoveredAlertsToReturn = alerts.rawRecoveredAlerts; } + throw new Error('fail'); + return { metrics: ruleRunMetricsStore.getMetrics(), state: { ``` 4. Wait for Kibana to restart 5. Run the rule on the UI by using "Run rule" 6. Observe the error message on the terminal 7. Remove the above change and wait for Kibana to restart. 8. Run the rule on the UI by using "Run rule" Rule should run without any error and update the alert and the task state. The same scenario should be failing on main.

…#235253) Resolves: elastic#190376 Rule execution fails after persisting alerts, therefore the alerts and the execution-uuids in the task state cannot be updated. On the next execution, same alert is reported but since the last execution-uuid wasn't added to the task state, the alert doc doesn't come in the tracked alerts. Therefore it is considered as a new alert but as it was already persistent in the previous execution gets a conflict error. This PR solves this problem by fetching the tracked alerts without depending on the task state. The new query groups the alerts of the running rule by execution-uuid. And fetches the number of executions as much as the flapping lookback window. Each execution-uuid group returns all the alerts belongs to itself under `inner_hits` ### To verify: 1. Create an always firing Elasticsearch Query rule with `1 hour` run interval. 2. Let the rule run and create an alert. 3. Apply the below diff ``` diff --git a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts index a5dcfa0..bb60c761740 100644 --- a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts +++ b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts @@ -438,6 +438,8 @@ export class TaskRunner< recoveredAlertsToReturn = alerts.rawRecoveredAlerts; } + throw new Error('fail'); + return { metrics: ruleRunMetricsStore.getMetrics(), state: { ``` 4. Wait for Kibana to restart 5. Run the rule on the UI by using "Run rule" 6. Observe the error message on the terminal 7. Remove the above change and wait for Kibana to restart. 8. Run the rule on the UI by using "Run rule" Rule should run without any error and update the alert and the task state. The same scenario should be failing on main. (cherry picked from commit 753d1cd) # Conflicts: # x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.test.ts # x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts

ersin-erdal · 2025-11-13T20:34:17Z

💚 All backports created successfully

Status	Branch	Result
✅	9.1
✅	8.19

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…#235253) Resolves: elastic#190376 Rule execution fails after persisting alerts, therefore the alerts and the execution-uuids in the task state cannot be updated. On the next execution, same alert is reported but since the last execution-uuid wasn't added to the task state, the alert doc doesn't come in the tracked alerts. Therefore it is considered as a new alert but as it was already persistent in the previous execution gets a conflict error. This PR solves this problem by fetching the tracked alerts without depending on the task state. The new query groups the alerts of the running rule by execution-uuid. And fetches the number of executions as much as the flapping lookback window. Each execution-uuid group returns all the alerts belongs to itself under `inner_hits` ### To verify: 1. Create an always firing Elasticsearch Query rule with `1 hour` run interval. 2. Let the rule run and create an alert. 3. Apply the below diff ``` diff --git a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts index a5dcfa0..bb60c761740 100644 --- a/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts +++ b/x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts @@ -438,6 +438,8 @@ export class TaskRunner< recoveredAlertsToReturn = alerts.rawRecoveredAlerts; } + throw new Error('fail'); + return { metrics: ruleRunMetricsStore.getMetrics(), state: { ``` 4. Wait for Kibana to restart 5. Run the rule on the UI by using "Run rule" 6. Observe the error message on the terminal 7. Remove the above change and wait for Kibana to restart. 8. Run the rule on the UI by using "Run rule" Rule should run without any error and update the alert and the task state. The same scenario should be failing on main. (cherry picked from commit 753d1cd) # Conflicts: # x-pack/platform/plugins/shared/alerting/server/alerts_client/alerts_client.test.ts # x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.test.ts # x-pack/platform/plugins/shared/alerting/server/task_runner/task_runner.ts

…235253) (#242967) # Backport This will backport the following commits from `main` to `8.19`: - [Fetch the tracked alerts without depending on the task state (#235253)](#235253)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sorenlouv/backport)

…235253) (#242965) # Backport This will backport the following commits from `main` to `9.1`: - [Fetch the tracked alerts without depending on the task state (#235253)](#235253)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sorenlouv/backport)

Fetch the tracked alerts without using anything from the task state

3ac89b3

ersin-erdal added backport:skip This PR does not require backporting Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v9.2.0 labels Sep 16, 2025

ersin-erdal added 3 commits September 16, 2025 20:52

fix type check

48be519

fix unit test

c800e96

fix migration

b43099a

ersin-erdal changed the title ~~Fetch the tracked alerts without using anything from the task state~~ Sep 17, 2025

ersin-erdal marked this pull request as ready for review September 17, 2025 13:41

ersin-erdal requested a review from a team as a code owner September 17, 2025 13:41

ersin-erdal added the release_note:fix label Sep 17, 2025

ymao1 reviewed Sep 17, 2025

View reviewed changes

x-pack/platform/plugins/shared/alerting/server/alerts_client/alerts_client.ts Outdated Show resolved Hide resolved

ersin-erdal added 4 commits September 17, 2025 20:43

Use timestamp rather than alert.start

ad9fb4c

Merge branch 'main' into 190376-version-conflict

26cc601

make two separate requests to get the tracked alerts

48d8598

fix type check

75049e2

ymao1 mentioned this pull request Sep 19, 2025

[Response Ops][Alerting] Updated ongoing recovered alerts with latest execution uuid #235846

Open

ymao1 reviewed Sep 19, 2025

View reviewed changes

ersin-erdal and others added 4 commits September 19, 2025 20:08

fix the search queries

cfa61e8

Merge branch 'main' into 190376-version-conflict

b179fdb

Merge branch 'main' into 190376-version-conflict

c3fa9c5

Merge branch 'main' into 190376-version-conflict

d90ed59

ymao1 approved these changes Sep 22, 2025

View reviewed changes

ersin-erdal and others added 2 commits September 22, 2025 16:56

add back-fall

5de9bd2

Merge branch 'main' into 190376-version-conflict

ed487c7

doakalexi approved these changes Sep 22, 2025

View reviewed changes

ersin-erdal merged commit 753d1cd into elastic:main Sep 22, 2025
12 checks passed

ersin-erdal mentioned this pull request Nov 13, 2025

[9.1] Fetch the tracked alerts without depending on the task state (#235253) #242965

Merged

ersin-erdal mentioned this pull request Nov 13, 2025

[8.19] Fetch the tracked alerts without depending on the task state (#235253) #242967

Merged

kibanamachine added the v8.19.8 label Nov 14, 2025

kibanamachine added the v9.1.8 label Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch the tracked alerts without depending on the task state#235253

Fetch the tracked alerts without depending on the task state#235253
ersin-erdal merged 14 commits intoelastic:mainfrom
ersin-erdal:190376-version-conflict

ersin-erdal commented Sep 16, 2025 •

edited

Loading

elasticmachine commented Sep 17, 2025

ymao1 Sep 17, 2025

ersin-erdal Sep 17, 2025

ymao1 Sep 19, 2025

Uh oh!

ymao1 Sep 19, 2025

ersin-erdal Sep 19, 2025

ymao1 Sep 19, 2025

ersin-erdal Sep 19, 2025

ymao1 left a comment

ymao1 Sep 22, 2025

doakalexi left a comment

elasticmachine commented Sep 22, 2025

Uh oh!

ersin-erdal commented Nov 13, 2025

Labels

5 participants

Conversation

ersin-erdal commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To verify:

elasticmachine commented Sep 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

doakalexi left a comment

Choose a reason for hiding this comment

elasticmachine commented Sep 22, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

Uh oh!

ersin-erdal commented Nov 13, 2025

💚 All backports created successfully

Questions ?

Labels

5 participants

ersin-erdal commented Sep 16, 2025 •

edited

Loading