Skip to content

Ensure Agent configuration and state persist across restarts in Fleet mode#8856

Merged
rhr323 merged 11 commits intoelastic:mainfrom
naemono:adjust-agent-config-path
Oct 21, 2025
Merged

Ensure Agent configuration and state persist across restarts in Fleet mode#8856
rhr323 merged 11 commits intoelastic:mainfrom
naemono:adjust-agent-config-path

Conversation

@naemono
Copy link
Copy Markdown
Contributor

@naemono naemono commented Oct 16, 2025

Resolves #8819
Related: elastic/elastic-agent#5185

Testing procedure

Bring es/kibana/agents online. Verified fleet page in kibana in fleet-mode, verified logs of agents themselves in non-fleet-mode. Restarted/killed agent pods, and ensured that the ones in the fleet-ui in Kibana didn't change names, or additional new ones weren't added.

For advanced configuration:
Add the following to the Agent manifest:

  config:
    fleet:
      enabled: true
    providers.kubernetes:
      add_resource_metadata:
        deployment: true

Exec into running agent pod and execute elastic-agent inspect verify that the following is present in the output:

providers:
  kubernetes:
    add_resource_metadata:
      deployment: true

Tested

  • 9.2.0-snapshot (fleet mode)
  • 9.2.0-snapshot (standalone mode)
  • 9.2.0-snapshot (fleet mode + advanced configuration)
  • 8.19 (fleet mode)
  • 8.19 (standalone mode)
  • 8.19 (fleet mode + advanced configuration)
  • 8.13 (fleet mode)
  • 8.13 (standalone mode)
  • 8.13 (fleet mode + advanced configuration)
  • 7.17 (fleet mode)
  • 7.17 (standalone mode)
mode.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@prodsecmachine
Copy link
Copy Markdown
Collaborator

prodsecmachine commented Oct 16, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Licenses 0 0 0 0 0 issues
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@botelastic botelastic Bot added the triage label Oct 16, 2025
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono
Copy link
Copy Markdown
Contributor Author

naemono commented Oct 17, 2025

buildkite test this

@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Oct 17, 2025

On 8.13.0 I ran into:

 Error: fail to read action store '/usr/share/elastic-agent/state/data/action_store.yml': yaml: input error: fail to decode bytes: cipher: message authentication failed         

Also, we need to verify that advanced configuration still works.

@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Oct 17, 2025

Another issue 9.1.2 when upgrading from ECK 3.1. to this PR:

│ Error: fail to read state store '/usr/share/elastic-agent/state/data/state.enc': failed migrating YAML store JSON store: could not parse YAML                                                                                                                                                                             │
│ fail to decode bytes: cipher: message authentication failed    
Comment thread pkg/controller/agent/pod.go
@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Oct 17, 2025

The errors I hightlighted above are all coming from the migration from the status quo in 3.1 where CONFIG_PATH is /etc/agent to the new CONFIG_PATH. Because agent stores information needed to decrypt its state in the CONFIG_PATH this is lost on the migration (or on any container restart I guess)

The guidance from @pkoutsovasilis is to force re-enroll with FLEET_FORCE=true develop a migration process that on the first attempt deletes fleet.enc and fleet.enc.lock from the state path. We need tracking for the migration (it should only be applied once when migrating from 3.1. to 3.2 or greater). I am wondering if we could use a marker file in the state path to track this and rely on a shell script. An alternative would be an annotation on the pod itself.

if [[ ! -f "/usr/share/elastic-agent/state/eck.config_migrated" ]]; then
  echo "Attempting to remove fleet.enc and fleet.enc.lock from state path (ignore if not present)"
  rm -f "/usr/share/elastic-agent/state/fleet.enc" "/usr/share/elastic-agent/state/fleet.enc.lock" 2>/dev/null || true
  echo "Creating eck.config_migrated marker"
  touch "/usr/share/elastic-agent/state/eck.config_migrated"
fi

What I don't know yet is how we detect the condition, that we actually need to delete these files. Ideally this should only happen when migrating from 3.1. to >=3.2

Update: I am suspecting the migration is only necessary if a user went back and forth between CONFIG_PATH=/etc/agent and CONFIG_PATH=STATE_PATH as I did during testing. So I think we could maybe provide this proactively to support in case uses switch back and forth between ECK versions and run into this. But we do not need to add this to the ECK orchestration logic.

@naemono naemono added the >bug Something isn't working label Oct 17, 2025
@botelastic botelastic Bot removed the triage label Oct 17, 2025
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono
Copy link
Copy Markdown
Contributor Author

naemono commented Oct 19, 2025

8.13 fleet-mode is throwing errors with this setup, specifically with the fleet-server agent:

Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied

Didn't investigate the issue as of yet...

7.17, 8.18, 9.x fleet-mode doesn't have the issue...

@naemono
Copy link
Copy Markdown
Contributor Author

naemono commented Oct 19, 2025

Testing advanced (9.2.0-snap) doesn't seem to work:

[root@gke-mmontgomery-dev-clus-default-pool-dbf006f3-xyov elastic-agent]# elastic-agent inspect | grep add_resource_metadata
[root@gke-mmontgomery-dev-clus-default-pool-dbf006f3-xyov elastic-agent]#

is this what you wanted applied @pebrc ? I'm not terribly familiar with this advanced config...

❯ kc get agent -n elastic-stack  eck-stack-eck-agent -o yaml | yq '.spec.config'
providers.kubernetes:
  add_resource_metadata:
    deployment: true
@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Oct 20, 2025

Testing advanced (9.2.0-snap) doesn't seem to work:

I think you are right. The implementation from this PR seems to break advanced config mode.
Only if I manually delete rm state/elastic-agent.yml and restart the agent will it pick up the advanced config specified with the -c flag. This sounds like a bug to me. cc @pkoutsovasilis

The previous implementation shipped with 3.1 declared /etc/agent to be the config directory, so anything we put there was automatically picked up.

@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Oct 20, 2025

After speaking with @pkoutsovasilis the relevant code on agent is https://github.com/elastic/elastic-agent/blob/6186951dfbad2a7e0e1a37c26097d5b4d9d38dba/internal/pkg/agent/cmd/container.go#L853-L858 It seems there is a bug in that this code only checks if the file is there but does not actually take contents into account.

We could look into an init container that removes the file if it exists to force agent to copy it again.

@pebrc
Copy link
Copy Markdown
Collaborator

pebrc commented Oct 20, 2025

We could look into an init container that removes the file if it exists to force agent to copy it again.

This might not be necessary after all. It seems that elastic-agent inspect is not a reliable way to check the configured providers. We need to probably test against the actually ingested data to see if it works.

@naemono naemono marked this pull request as ready for review October 20, 2025 15:27
@naemono naemono changed the title WIP: Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet mode Oct 20, 2025
Copy link
Copy Markdown
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I did extensive testing on this one. @rhr323 we need a known issue entry for this one. If users downgrade to 3.1 or before and upgrade again to 3.2 they might run into errors like

│ Error: fail to read state store '/usr/share/elastic-agent/state/data/state.enc': failed migrating YAML store JSON store: could not parse YAML                                                                                                                                                                             │
│ fail to decode bytes: cipher: message authentication failed    

or

 Error: fail to read action store '/usr/share/elastic-agent/state/data/action_store.yml': yaml: input error: fail to decode bytes: cipher: message authentication failed         

in these cases they should add the FLEET_FORCE=true environment variable to their manifest to force agent to enrol anew (it can be removed once the agent has re-enroled)

@rhr323 rhr323 merged commit 63bff02 into elastic:main Oct 21, 2025
9 checks passed
rhr323 pushed a commit to rhr323/cloud-on-k8s that referenced this pull request Oct 21, 2025
…eet mode (elastic#8856)

* Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet
mode.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Add comment

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Maybe add init container.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Fix to be deployment specific

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* remove unneeded call

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Add some logging

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Fix bug

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Add check for existing init container

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Add volume mount

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Remove test code to add init container.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

* Adjust test expecations to new command

---------

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Co-authored-by: Peter Brachwitz <peter.brachwitz@elastic.co>
(cherry picked from commit 63bff02)
@rhr323
Copy link
Copy Markdown
Contributor

rhr323 commented Oct 21, 2025

💚 All backports created successfully

Status Branch Result
3.2

Questions ?

Please refer to the Backport tool documentation

rhr323 added a commit that referenced this pull request Oct 21, 2025
…eet mode (#8856) (#8859)

* Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet
mode.



* Add comment



* Maybe add init container.



* Fix to be deployment specific



* remove unneeded call



* Add some logging



* Fix bug



* Add check for existing init container



* Add volume mount



* Remove test code to add init container.



* Adjust test expecations to new command

---------



(cherry picked from commit 63bff02)

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Co-authored-by: Michael Montgomery <mmontg1@gmail.com>
Co-authored-by: Peter Brachwitz <peter.brachwitz@elastic.co>
pebrc added a commit that referenced this pull request Oct 23, 2025
)

We ran into issues with Fleet server no longer enroling with the changes from #8856
This proposes to version gate the functionality to Elastic agent versions that actually support advanced configuration.
pebrc added a commit to pebrc/cloud-on-k8s that referenced this pull request Oct 23, 2025
…astic#8869)

We ran into issues with Fleet server no longer enroling with the changes from elastic#8856
This proposes to version gate the functionality to Elastic agent versions that actually support advanced configuration.

(cherry picked from commit 70d91eb)
rhr323 pushed a commit that referenced this pull request Oct 23, 2025
) (#8873)

We ran into issues with Fleet server no longer enroling with the changes from #8856
This proposes to version gate the functionality to Elastic agent versions that actually support advanced configuration.

(cherry picked from commit 70d91eb)
@rhr323 rhr323 added the v3.2.0 label Oct 27, 2025
@rhr323 rhr323 changed the title Consistently mount CONFIG_PATH and STATE_PATH to same directory in Fleet mode Oct 27, 2025
@rhr323 rhr323 mentioned this pull request Oct 27, 2025
@rhr323 rhr323 changed the title Set CONFIG_PATH and STATE_PATH to the same directory in Fleet mode Oct 30, 2025
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Oct 31, 2025
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [eck-operator](https://github.com/elastic/cloud-on-k8s) | minor | `3.1.0` -> `3.2.0` |

---

### Release Notes

<details>
<summary>elastic/cloud-on-k8s (eck-operator)</summary>

### [`v3.2.0`](https://github.com/elastic/cloud-on-k8s/releases/tag/v3.2.0)

[Compare Source](elastic/cloud-on-k8s@v3.1.0...v3.2.0)

### Elastic Cloud on Kubernetes 3.2.0

- [Quickstart guide](https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s#eck-quickstart)

##### Release Highlights

##### Automatic pod disruption budget (Enterprise feature)

ECK now offers better out-of-the-box PodDisruptionBudgets that automatically keep your cluster available as Pods move across nodes. The new policy calculates the number of Pods per tier that can sustain replacement and automatically generates a PodDisruptionBudget for each tier, enabling the Elasticsearch cluster to vacate Kubernetes nodes more quickly, while considering cluster health, without interruption.

##### User Password Generation (Enterprise feature)

ECK will now generate longer passwords by default for the administrative user of each Elasticsearch cluster. The password is 24 characters in length by default (can be configured to a maximum of 72 characters), incorporating alphabetic and numeric characters, to make password complexity stronger.

##### Features and enhancements

- Enable certificate reloading for stack monitoring Beats [#&#8203;8833](elastic/cloud-on-k8s#8833) (issue: [#&#8203;5448](elastic/cloud-on-k8s#5448))
- Allow configuration of file-based password character set and length [#&#8203;8817](elastic/cloud-on-k8s#8817) (issues: [#&#8203;2795](elastic/cloud-on-k8s#2795), [#&#8203;8693](elastic/cloud-on-k8s#8693))
- Automatically set GOMEMLIMIT based on cgroups memory limits [#&#8203;8814](elastic/cloud-on-k8s#8814) (issue: [#&#8203;8790](elastic/cloud-on-k8s#8790))
- Introduce granular PodDisruptionBudgets based on node roles [#&#8203;8780](elastic/cloud-on-k8s#8780) (issue: [#&#8203;2936](elastic/cloud-on-k8s#2936))

##### Fixes

- Gate advanced Fleet config logic to Agent v8.13 and later [#&#8203;8869](elastic/cloud-on-k8s#8869)
- Ensure Agent configuration and state persist across restarts in Fleet mode [#&#8203;8856](elastic/cloud-on-k8s#8856) (issue: [#&#8203;8819](elastic/cloud-on-k8s#8819))
- Do not set credentials label on Kibana config secret [#&#8203;8852](elastic/cloud-on-k8s#8852) (issue: [#&#8203;8839](elastic/cloud-on-k8s#8839))
- Allow elasticsearchRef.secretName in Kibana helm validation [#&#8203;8822](elastic/cloud-on-k8s#8822) (issue: [#&#8203;8816](elastic/cloud-on-k8s#8816))

##### Documentation improvements

- Update Logstash recipes from to filestream input [#&#8203;8801](elastic/cloud-on-k8s#8801)
- Recipe for exposing Fleet server to outside of the Kubernetes cluster [#&#8203;8788](elastic/cloud-on-k8s#8788)
- Clarify secretName restrictions [#&#8203;8782](elastic/cloud-on-k8s#8782)
- Update ES\_JAVA\_OPTS comments and explain auto-heap behavior [#&#8203;8753](elastic/cloud-on-k8s#8753)

##### Dependency updates

- github.com/gkampitakis/go-snaps v0.5.13 => v0.5.15
- github.com/hashicorp/vault/api v1.20.0 => v1.22.0
- github.com/KimMachineGun/automemlimit => v0.7.4
- github.com/prometheus/client\_golang v1.22.0 => v1.23.2
- github.com/prometheus/common v0.65.0 => v0.67.1
- github.com/sethvargo/go-password v0.3.1 => REMOVED
- github.com/spf13/cobra v1.9.1 => v1.10.1
- github.com/spf13/pflag v1.0.6 => v1.0.10
- github.com/spf13/viper v1.20.1 => v1.21.0
- github.com/stretchr/testify v1.10.0 => v1.11.1
- golang.org/x/crypto v0.40.0 => v0.43.0
- k8s.io/api v0.33.2 => v0.34.1
- k8s.io/apimachinery v0.33.2 => v0.34.1
- k8s.io/client-go v0.33.2 => v0.34.1
- k8s.io/utils v0.0.0-20241104100929-3ea5e8cea738 => v0.0.0-20250604170112-4c0f3b243397
- sigs.k8s.io/controller-runtime v0.21.0 => v0.22.2
- sigs.k8s.io/controller-tools v0.18.0 => v0.19.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS4xNTUuNCIsInVwZGF0ZWRJblZlciI6IjQxLjE1NS40IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJjaGFydCJdfQ==-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/1911
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug Something isn't working v3.2.0

4 participants