Skip to content

Retry transient errors on agents search#243105

Merged
MichelLosier merged 3 commits intoelastic:mainfrom
MichelLosier:retry-transient-errors-on-agents-search
Nov 17, 2025
Merged

Retry transient errors on agents search#243105
MichelLosier merged 3 commits intoelastic:mainfrom
MichelLosier:retry-transient-errors-on-agents-search

Conversation

@MichelLosier
Copy link
Contributor

@MichelLosier MichelLosier commented Nov 14, 2025

Summary

Partially resolves: #243098

We can sometimes get a no_shard_available_action_exception when searching agents on /api/fleet/agents while awaiting the first agent to be enrolled in fleet, and for the .fleet-agents index to exist.

This PR mitigates this issue by adding retries on transient ES errors.

Release notes

  • Adds retry behavior for /api/fleet/agents when transient issues with ES are encountered.

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
  • This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
  • Flaky Test Runner was used on any tests changed
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
  • Review the backport guidelines and apply applicable backport:* labels.

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

@MichelLosier MichelLosier requested a review from a team as a code owner November 14, 2025 22:01
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 14, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nchaulet
Copy link
Member

Should we wrap all the call to es.search in that crud file with retryTransientEsErrors?

Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@MichelLosier MichelLosier force-pushed the retry-transient-errors-on-agents-search branch from e2ff26b to 3a6cfa4 Compare November 17, 2025 17:40
@MichelLosier MichelLosier enabled auto-merge (squash) November 17, 2025 17:49
@MichelLosier MichelLosier merged commit 2d89709 into elastic:main Nov 17, 2025
12 checks passed
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

cc @MichelLosier

@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.19, 9.1, 9.2

https://github.com/elastic/kibana/actions/runs/19441976998

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 17, 2025
Partially resolves: elastic#243098

We can sometimes get a `no_shard_available_action_exception` when
searching agents on `/api/fleet/agents` while awaiting the first agent
to be enrolled in fleet, and for the `.fleet-agents` index to exist.

This PR mitigates this issue by adding retries on transient ES errors.

## Release notes

* Adds retry behavior for `/api/fleet/agents` endpoints when transient issues with
ES are encountered.

(cherry picked from commit 2d89709)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 17, 2025
Partially resolves: elastic#243098

We can sometimes get a `no_shard_available_action_exception` when
searching agents on `/api/fleet/agents` while awaiting the first agent
to be enrolled in fleet, and for the `.fleet-agents` index to exist.

This PR mitigates this issue by adding retries on transient ES errors.

## Release notes

* Adds retry behavior for `/api/fleet/agents` endpoints when transient issues with
ES are encountered.

(cherry picked from commit 2d89709)
@kibanamachine
Copy link
Contributor

💔 Some backports could not be created

Status Branch Result
8.19 Backport failed because of merge conflicts
9.1
9.2

Note: Successful backport PRs will be merged automatically after passing CI.

Manual backport

To create the backport manually run:

node scripts/backport --pr 243105

Questions ?

Please refer to the Backport tool documentation

@MichelLosier
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.19

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

MichelLosier added a commit to MichelLosier/kibana that referenced this pull request Nov 17, 2025
Partially resolves: elastic#243098

We can sometimes get a `no_shard_available_action_exception` when
searching agents on `/api/fleet/agents` while awaiting the first agent
to be enrolled in fleet, and for the `.fleet-agents` index to exist.

This PR mitigates this issue by adding retries on transient ES errors.

## Release notes

* Adds retry behavior for `/api/fleet/agents` endpoints when transient issues with
ES are encountered.

(cherry picked from commit 2d89709)

# Conflicts:
#	x-pack/platform/plugins/shared/fleet/server/services/agents/crud.ts
kibanamachine added a commit that referenced this pull request Nov 17, 2025
# Backport

This will backport the following commits from `main` to `9.2`:
- [Retry transient errors on agents search
(#243105)](#243105)

<!--- Backport version: 9.6.6 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Michel
Losier","email":"michel.losier@elastic.co"},"sourceCommit":{"committedDate":"2025-11-17T19:29:16Z","message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64","branchLabelMapping":{"^v9.3.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:Fleet","backport:version","v9.3.0","v8.19.8","v9.2.2","v9.1.8"],"title":"Retry
transient errors on agents
search","number":243105,"url":"https://github.com/elastic/kibana/pull/243105","mergeCommit":{"message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64"}},"sourceBranch":"main","suggestedTargetBranches":["8.19","9.2","9.1"],"targetPullRequestStates":[{"branch":"main","label":"v9.3.0","branchLabelMappingKey":"^v9.3.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/243105","number":243105,"mergeCommit":{"message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64"}},{"branch":"8.19","label":"v8.19.8","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"9.2","label":"v9.2.2","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"9.1","label":"v9.1.8","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Michel Losier <michel.losier@elastic.co>
kibanamachine added a commit that referenced this pull request Nov 17, 2025
# Backport

This will backport the following commits from `main` to `9.1`:
- [Retry transient errors on agents search
(#243105)](#243105)

<!--- Backport version: 9.6.6 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Michel
Losier","email":"michel.losier@elastic.co"},"sourceCommit":{"committedDate":"2025-11-17T19:29:16Z","message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64","branchLabelMapping":{"^v9.3.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:Fleet","backport:version","v9.3.0","v8.19.8","v9.2.2","v9.1.8"],"title":"Retry
transient errors on agents
search","number":243105,"url":"https://github.com/elastic/kibana/pull/243105","mergeCommit":{"message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64"}},"sourceBranch":"main","suggestedTargetBranches":["8.19","9.2","9.1"],"targetPullRequestStates":[{"branch":"main","label":"v9.3.0","branchLabelMappingKey":"^v9.3.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/243105","number":243105,"mergeCommit":{"message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64"}},{"branch":"8.19","label":"v8.19.8","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"9.2","label":"v9.2.2","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"9.1","label":"v9.1.8","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Michel Losier <michel.losier@elastic.co>
MichelLosier added a commit that referenced this pull request Nov 18, 2025
# Backport

This will backport the following commits from `main` to `8.19`:
- [Retry transient errors on agents search
(#243105)](#243105)

<!--- Backport version: 10.1.0 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Michel
Losier","email":"michel.losier@elastic.co"},"sourceCommit":{"committedDate":"2025-11-17T19:29:16Z","message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64","branchLabelMapping":{"^v9.3.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:Fleet","backport:version","v9.3.0","v8.19.8","v9.2.2","v9.1.8"],"title":"Retry
transient errors on agents
search","number":243105,"url":"https://github.com/elastic/kibana/pull/243105","mergeCommit":{"message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64"}},"sourceBranch":"main","suggestedTargetBranches":["8.19"],"targetPullRequestStates":[{"branch":"main","label":"v9.3.0","branchLabelMappingKey":"^v9.3.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/243105","number":243105,"mergeCommit":{"message":"Retry
transient errors on agents search (#243105)\n\nPartially resolves:
https://github.com/elastic/kibana/issues/243098\n\nWe can sometimes get
a `no_shard_available_action_exception` when\nsearching agents on
`/api/fleet/agents` while awaiting the first agent\nto be enrolled in
fleet, and for the `.fleet-agents` index to exist.\n\nThis PR mitigates
this issue by adding retries on transient ES errors.\n\n## Release
notes\n\n* Adds retry behavior for `/api/fleet/agents` endpoints when
transient issues with\nES are
encountered.","sha":"2d8970911411a579a2fb77b44cb6418da9cbcc64"}},{"branch":"8.19","label":"v8.19.8","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"9.2","label":"v9.2.2","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"url":"https://github.com/elastic/kibana/pull/243275","number":243275,"state":"OPEN"},{"branch":"9.1","label":"v9.1.8","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"url":"https://github.com/elastic/kibana/pull/243274","number":243274,"state":"OPEN"}]}]
BACKPORT-->
eokoneyo pushed a commit to eokoneyo/kibana that referenced this pull request Dec 2, 2025
Partially resolves: elastic#243098

We can sometimes get a `no_shard_available_action_exception` when
searching agents on `/api/fleet/agents` while awaiting the first agent
to be enrolled in fleet, and for the `.fleet-agents` index to exist.

This PR mitigates this issue by adding retries on transient ES errors.

## Release notes

* Adds retry behavior for `/api/fleet/agents` endpoints when transient issues with
ES are encountered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:version Backport to applied version labels release_note:fix Team:Fleet Team label for Observability Data Collection Fleet team v8.19.8 v9.1.8 v9.2.2 v9.3.0

5 participants