Clear agent.upgrade_attempts on upgrade complete#4528
Conversation
|
This pull request does not have a backport label. Could you fix it @jillguyonnet? 🙏
|
pchila
left a comment
There was a problem hiding this comment.
Code looks sensible, holding off the approval until we have a green CI run (it seems we are getting some errors from ECH when creating a stack and 503s when trying to clean it up (not sure if it's related to the failed creation)
michel-laterman
left a comment
There was a problem hiding this comment.
lgtm, merge when CI is green
|
There is a test you should update to check that this field is properly reset: fleet-server/internal/pkg/api/handleCheckin_test.go Lines 334 to 354 in ba68a24 |
changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml
Outdated
Show resolved
Hide resolved
|
Upon more testing with elastic/kibana#212744, this is not behaving as expected as |
|
OK, my original approach didn't do what is expected, i.e. only reset I have changed the approach to reset @michel-laterman Would you please be able to review this approach? If it looks OK, I will update tests. Edit: |
changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml
Outdated
Show resolved
Hide resolved
## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
|
|
@michel-laterman could you approve again? had to fix the sonarqube failure |
## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416)
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>
* Clear upgrade_attempts on handleAck (#4762) * clear upgrade_attempts on handleAck * clear upgrade_attempts if upgrade_details is missing * added unit test (cherry picked from commit fb093cc) * Clear agent.upgrade_attempts on upgrade complete (#4528) (#4777) * Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Julia Bardi <julia.bardi@elastic.co> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co> --------- Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Jill Guyonnet <jill.guyonnet@elastic.co>




What is the problem this PR solves?
elastic/kibana#212744 adds retry logic to the task that automatically ugprades agents. Agents that were upgraded through this task have their new
upgrade_attemptsproperty populated. It is missing a way to clear this property when the upgrade completes successfully.How does this PR solve the problem?
The change in this PR clears
upgrade_attemptswhen the upgrade details of the agent get intoUPG_WATCHINGstate and are processed inhandleCheckin.How to test this PR locally
This should be tested alongside elastic/kibana#212744 (or after it is merged - this is fine, since automatic upgrades are currently behind the
enableAutomaticAgentUpgradesfeature flag). With this change, agents upgraded through the automatic upgrade task should have theirupgrade_attemptsproperty set tonullwhen the upgrade is successful.Testing should also validate that
upgrade_attemptsstays set if the upgrade failed, e.g. after requesting an upgrade to an invalid version.Design Checklist
Checklist
./changelog/fragmentsusing the changelog toolRelated issues
Relates https://github.com/elastic/ingest-dev/issues/4720