Ensure Agent configuration and state persist across restarts in Fleet mode by naemono · Pull Request #8856 · elastic/cloud-on-k8s

naemono · 2025-10-16T19:03:55Z

Resolves #8819
Related: elastic/elastic-agent#5185

Testing procedure

Bring es/kibana/agents online. Verified fleet page in kibana in fleet-mode, verified logs of agents themselves in non-fleet-mode. Restarted/killed agent pods, and ensured that the ones in the fleet-ui in Kibana didn't change names, or additional new ones weren't added.

For advanced configuration:
Add the following to the Agent manifest:

  config:
    fleet:
      enabled: true
    providers.kubernetes:
      add_resource_metadata:
        deployment: true

Exec into running agent pod and execute elastic-agent inspect verify that the following is present in the output:

providers:
  kubernetes:
    add_resource_metadata:
      deployment: true

Tested

mode. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

prodsecmachine · 2025-10-16T19:04:09Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
✅	Licenses	0	0	0	0	0 issues
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

naemono · 2025-10-17T00:59:32Z

buildkite test this

pebrc · 2025-10-17T07:27:57Z

On 8.13.0 I ran into:

 Error: fail to read action store '/usr/share/elastic-agent/state/data/action_store.yml': yaml: input error: fail to decode bytes: cipher: message authentication failed

Also, we need to verify that advanced configuration still works.

pebrc · 2025-10-17T08:18:09Z

Another issue 9.1.2 when upgrading from ECK 3.1. to this PR:

│ Error: fail to read state store '/usr/share/elastic-agent/state/data/state.enc': failed migrating YAML store JSON store: could not parse YAML                                                                                                                                                                             │
│ fail to decode bytes: cipher: message authentication failed

pebrc · 2025-10-17T09:23:49Z

The errors I hightlighted above are all coming from the migration from the status quo in 3.1 where CONFIG_PATH is /etc/agent to the new CONFIG_PATH. Because agent stores information needed to decrypt its state in the CONFIG_PATH this is lost on the migration (or on any container restart I guess)

The guidance from @pkoutsovasilis is to force re-enroll with FLEET_FORCE=true develop a migration process that on the first attempt deletes fleet.enc and fleet.enc.lock from the state path. We need tracking for the migration (it should only be applied once when migrating from 3.1. to 3.2 or greater). I am wondering if we could use a marker file in the state path to track this and rely on a shell script. An alternative would be an annotation on the pod itself.

if [[ ! -f "/usr/share/elastic-agent/state/eck.config_migrated" ]]; then
  echo "Attempting to remove fleet.enc and fleet.enc.lock from state path (ignore if not present)"
  rm -f "/usr/share/elastic-agent/state/fleet.enc" "/usr/share/elastic-agent/state/fleet.enc.lock" 2>/dev/null || true
  echo "Creating eck.config_migrated marker"
  touch "/usr/share/elastic-agent/state/eck.config_migrated"
fi

~~What I don't know yet is how we detect the condition, that we actually need to delete these files. Ideally this should only happen when migrating from 3.1. to >=3.2~~

Update: I am suspecting the migration is only necessary if a user went back and forth between CONFIG_PATH=/etc/agent and CONFIG_PATH=STATE_PATH as I did during testing. So I think we could maybe provide this proactively to support in case uses switch back and forth between ECK versions and run into this. But we do not need to add this to the ECK orchestration logic.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

naemono · 2025-10-19T20:52:54Z

8.13 fleet-mode is throwing errors with this setup, specifically with the fleet-server agent:

Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied

Didn't investigate the issue as of yet...

7.17, 8.18, 9.x fleet-mode doesn't have the issue...

naemono · 2025-10-19T21:28:47Z

Testing advanced (9.2.0-snap) doesn't seem to work:

[root@gke-mmontgomery-dev-clus-default-pool-dbf006f3-xyov elastic-agent]# elastic-agent inspect | grep add_resource_metadata
[root@gke-mmontgomery-dev-clus-default-pool-dbf006f3-xyov elastic-agent]#

is this what you wanted applied @pebrc ? I'm not terribly familiar with this advanced config...

❯ kc get agent -n elastic-stack  eck-stack-eck-agent -o yaml | yq '.spec.config'
providers.kubernetes:
  add_resource_metadata:
    deployment: true

pebrc · 2025-10-20T11:35:28Z

Testing advanced (9.2.0-snap) doesn't seem to work:

I think you are right. The implementation from this PR seems to break advanced config mode.
Only if I manually delete rm state/elastic-agent.yml and restart the agent will it pick up the advanced config specified with the -c flag. This sounds like a bug to me. cc @pkoutsovasilis

The previous implementation shipped with 3.1 declared /etc/agent to be the config directory, so anything we put there was automatically picked up.

pebrc · 2025-10-20T13:09:28Z

After speaking with @pkoutsovasilis the relevant code on agent is https://github.com/elastic/elastic-agent/blob/6186951dfbad2a7e0e1a37c26097d5b4d9d38dba/internal/pkg/agent/cmd/container.go#L853-L858 It seems there is a bug in that this code only checks if the file is there but does not actually take contents into account.

We could look into an init container that removes the file if it exists to force agent to copy it again.

pebrc · 2025-10-20T13:52:53Z

We could look into an init container that removes the file if it exists to force agent to copy it again.

This might not be necessary after all. It seems that elastic-agent inspect is not a reliable way to check the configured providers. We need to probably test against the actually ingested data to see if it works.

pebrc

LGTM I did extensive testing on this one. @rhr323 we need a known issue entry for this one. If users downgrade to 3.1 or before and upgrade again to 3.2 they might run into errors like

│ Error: fail to read state store '/usr/share/elastic-agent/state/data/state.enc': failed migrating YAML store JSON store: could not parse YAML                                                                                                                                                                             │
│ fail to decode bytes: cipher: message authentication failed

or

 Error: fail to read action store '/usr/share/elastic-agent/state/data/action_store.yml': yaml: input error: fail to decode bytes: cipher: message authentication failed

in these cases they should add the FLEET_FORCE=true environment variable to their manifest to force agent to enrol anew (it can be removed once the agent has re-enroled)

…eet mode (elastic#8856) * Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet mode. Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Add comment Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Maybe add init container. Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Fix to be deployment specific Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * remove unneeded call Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Add some logging Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Fix bug Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Add check for existing init container Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Add volume mount Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Remove test code to add init container. Signed-off-by: Michael Montgomery <mmontg1@gmail.com> * Adjust test expecations to new command --------- Signed-off-by: Michael Montgomery <mmontg1@gmail.com> Co-authored-by: Peter Brachwitz <peter.brachwitz@elastic.co> (cherry picked from commit 63bff02)

rhr323 · 2025-10-21T08:03:20Z

💚 All backports created successfully

Status	Branch	Result
✅	3.2

Questions ?

Please refer to the Backport tool documentation

…eet mode (#8856) (#8859) * Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet mode. * Add comment * Maybe add init container. * Fix to be deployment specific * remove unneeded call * Add some logging * Fix bug * Add check for existing init container * Add volume mount * Remove test code to add init container. * Adjust test expecations to new command --------- (cherry picked from commit 63bff02) Signed-off-by: Michael Montgomery <mmontg1@gmail.com> Co-authored-by: Michael Montgomery <mmontg1@gmail.com> Co-authored-by: Peter Brachwitz <peter.brachwitz@elastic.co>

) We ran into issues with Fleet server no longer enroling with the changes from #8856 This proposes to version gate the functionality to Elastic agent versions that actually support advanced configuration.

…astic#8869) We ran into issues with Fleet server no longer enroling with the changes from elastic#8856 This proposes to version gate the functionality to Elastic agent versions that actually support advanced configuration. (cherry picked from commit 70d91eb)

) (#8873) We ran into issues with Fleet server no longer enroling with the changes from #8856 This proposes to version gate the functionality to Elastic agent versions that actually support advanced configuration. (cherry picked from commit 70d91eb)

This PR contains the following updates: | Package | Update | Change | |---|---|---| | [eck-operator](https://github.com/elastic/cloud-on-k8s) | minor | `3.1.0` -> `3.2.0` | --- ### Release Notes <details> <summary>elastic/cloud-on-k8s (eck-operator)</summary> ### [`v3.2.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.2.0) [Compare Source](elastic/cloud-on-k8s@v3.1.0...v3.2.0) ### Elastic Cloud on Kubernetes 3.2.0 - [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart) ##### Release Highlights ##### Automatic pod disruption budget (Enterprise feature) ECK now offers better out-of-the-box PodDisruptionBudgets that automatically keep your cluster available as Pods move across nodes. The new policy calculates the number of Pods per tier that can sustain replacement and automatically generates a PodDisruptionBudget for each tier, enabling the Elasticsearch cluster to vacate Kubernetes nodes more quickly, while considering cluster health, without interruption. ##### User Password Generation (Enterprise feature) ECK will now generate longer passwords by default for the administrative user of each Elasticsearch cluster. The password is 24 characters in length by default (can be configured to a maximum of 72 characters), incorporating alphabetic and numeric characters, to make password complexity stronger. ##### Features and enhancements - Enable certificate reloading for stack monitoring Beats [#8833](elastic/cloud-on-k8s#8833) (issue: [#5448](elastic/cloud-on-k8s#5448)) - Allow configuration of file-based password character set and length [#8817](elastic/cloud-on-k8s#8817) (issues: [#2795](elastic/cloud-on-k8s#2795), [#8693](elastic/cloud-on-k8s#8693)) - Automatically set GOMEMLIMIT based on cgroups memory limits [#8814](elastic/cloud-on-k8s#8814) (issue: [#8790](elastic/cloud-on-k8s#8790)) - Introduce granular PodDisruptionBudgets based on node roles [#8780](elastic/cloud-on-k8s#8780) (issue: [#2936](elastic/cloud-on-k8s#2936)) ##### Fixes - Gate advanced Fleet config logic to Agent v8.13 and later [#8869](elastic/cloud-on-k8s#8869) - Ensure Agent configuration and state persist across restarts in Fleet mode [#8856](elastic/cloud-on-k8s#8856) (issue: [#8819](elastic/cloud-on-k8s#8819)) - Do not set credentials label on Kibana config secret [#8852](elastic/cloud-on-k8s#8852) (issue: [#8839](elastic/cloud-on-k8s#8839)) - Allow elasticsearchRef.secretName in Kibana helm validation [#8822](elastic/cloud-on-k8s#8822) (issue: [#8816](elastic/cloud-on-k8s#8816)) ##### Documentation improvements - Update Logstash recipes from to filestream input [#8801](elastic/cloud-on-k8s#8801) - Recipe for exposing Fleet server to outside of the Kubernetes cluster [#8788](elastic/cloud-on-k8s#8788) - Clarify secretName restrictions [#8782](elastic/cloud-on-k8s#8782) - Update ES\_JAVA\_OPTS comments and explain auto-heap behavior [#8753](elastic/cloud-on-k8s#8753) ##### Dependency updates - github.com/gkampitakis/go-snaps v0.5.13 => v0.5.15 - github.com/hashicorp/vault/api v1.20.0 => v1.22.0 - github.com/KimMachineGun/automemlimit => v0.7.4 - github.com/prometheus/client\_golang v1.22.0 => v1.23.2 - github.com/prometheus/common v0.65.0 => v0.67.1 - github.com/sethvargo/go-password v0.3.1 => REMOVED - github.com/spf13/cobra v1.9.1 => v1.10.1 - github.com/spf13/pflag v1.0.6 => v1.0.10 - github.com/spf13/viper v1.20.1 => v1.21.0 - github.com/stretchr/testify v1.10.0 => v1.11.1 - golang.org/x/crypto v0.40.0 => v0.43.0 - k8s.io/api v0.33.2 => v0.34.1 - k8s.io/apimachinery v0.33.2 => v0.34.1 - k8s.io/client-go v0.33.2 => v0.34.1 - k8s.io/utils v0.0.0-20241104100929-3ea5e8cea738 => v0.0.0-20250604170112-4c0f3b243397 - sigs.k8s.io/controller-runtime v0.21.0 => v0.22.2 - sigs.k8s.io/controller-tools v0.18.0 => v0.19.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/1911 Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net> Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>

Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet

86d97d7

mode. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

botelastic Bot added the triage label Oct 16, 2025

Add comment

29875d1

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

pebrc reviewed Oct 17, 2025

View reviewed changes

Comment thread pkg/controller/agent/pod.go

naemono added the >bug Something isn't working label Oct 17, 2025

botelastic Bot removed the triage label Oct 17, 2025

naemono added 8 commits October 17, 2025 13:49

Maybe add init container.

55c32f2

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Fix to be deployment specific

d1058df

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

remove unneeded call

784a032

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Add some logging

ead15c7

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Fix bug

ec688df

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Add check for existing init container

8000c57

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Add volume mount

2de2338

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Remove test code to add init container.

f25b325

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Adjust test expecations to new command

dd0e4b6

naemono marked this pull request as ready for review October 20, 2025 15:27

naemono changed the title ~~WIP: Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet mode~~ Oct 20, 2025

pebrc approved these changes Oct 20, 2025

View reviewed changes

rhr323 merged commit 63bff02 into elastic:main Oct 21, 2025
9 checks passed

rhr323 mentioned this pull request Oct 21, 2025

[3.2] Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet mode (#8856) #8859

Merged

pebrc mentioned this pull request Oct 22, 2025

Gate advanced Fleet config logic to Agent v8.13 and later #8869

Merged

rhr323 added the v3.2.0 label Oct 27, 2025

rhr323 changed the title ~~Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet mode~~ Oct 27, 2025

rhr323 mentioned this pull request Oct 27, 2025

[3.2] Release notes #8844

Merged

rhr323 changed the title ~~Set CONFIG_PATH and STATE_PATH to the same directory in Fleet mode~~ Oct 30, 2025

pebrc mentioned this pull request Dec 5, 2025

fleet-server "failed to fetch elasticsearch version" - ECK install on OpenShift isn't working #8111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Agent configuration and state persist across restarts in Fleet mode#8856

Ensure Agent configuration and state persist across restarts in Fleet mode#8856
rhr323 merged 11 commits intoelastic:mainfrom
naemono:adjust-agent-config-path

naemono commented Oct 16, 2025 •

edited by pebrc

Loading

prodsecmachine commented Oct 16, 2025 •

edited

Loading

naemono commented Oct 17, 2025

pebrc commented Oct 17, 2025

pebrc commented Oct 17, 2025

Uh oh!

pebrc commented Oct 17, 2025 •

edited

Loading

naemono commented Oct 19, 2025 •

edited

Loading

naemono commented Oct 19, 2025

pebrc commented Oct 20, 2025 •

edited

Loading

pebrc commented Oct 20, 2025

pebrc commented Oct 20, 2025

pebrc left a comment

Uh oh!

rhr323 commented Oct 21, 2025

Labels

4 participants

Conversation

naemono commented Oct 16, 2025 • edited by pebrc Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing procedure

Tested

prodsecmachine commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

naemono commented Oct 17, 2025

pebrc commented Oct 17, 2025

pebrc commented Oct 17, 2025

Uh oh!

pebrc commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

naemono commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

naemono commented Oct 19, 2025

pebrc commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pebrc commented Oct 20, 2025

pebrc commented Oct 20, 2025

pebrc left a comment

Choose a reason for hiding this comment

Uh oh!

rhr323 commented Oct 21, 2025

💚 All backports created successfully

Questions ?

Labels

4 participants

naemono commented Oct 16, 2025 •

edited by pebrc

Loading

prodsecmachine commented Oct 16, 2025 •

edited

Loading

pebrc commented Oct 17, 2025 •

edited

Loading

naemono commented Oct 19, 2025 •

edited

Loading

pebrc commented Oct 20, 2025 •

edited

Loading