Avoid uninstalling and re-installing service components on policy change by ycombinator · Pull Request #11740 · elastic/elastic-agent

ycombinator · 2025-12-11T02:10:07Z

What does this PR do?

This PR identifies Service Runtime components with only their input type; the output ID is not longer used.

Why is it important?

Service Runtime components are intended to be kept running (via a service) for as long as possible. We should only install/start or uninstall/stop them if they are being explicitly added or removed, respectively, from the component model. If only their configuration is being updated, we should keep the component running.

If a component's ID changes between the last and current component models, Elastic Agent will ask the component's service to uninstall and then reinstall itself. Prior to this PR, service components' ID were determined by their input type and output ID. Therefore, if a service component's output were changed, it would cause the service to be uninstall and then reinstalled. This is undesirable behavior, as services should be kept running as long as possible.

With the changes in this PR, we no longer consider the output ID when generating service components' IDs. If a service component's output is changed, it's ID remains the same between the last and current component models. Elastic Agent does not uninstall and reinstall the component's service but simply passes the configuration change to it (which it was doing prior to this PR anyway).

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~I have added an integration test or an E2E test~~

Disruptive User Impact

None.

How to test this PR locally

Policy reassign does not uninstall/reinstall Endpoint

Using the Fleet UI, create three Agent policies:
- default: containing only the system integration
- tp-es: containing the Elastic Defend integration, with tamper protection enabled, and using the Elasticsearch output.
- tp-ls: containing the Elastic Defend integration, with tamper protection enabled, and using the Logstash output. Note that you will need to create the Logstash output in Fleet > Settings.
Enroll an Elastic Agent in the tp-es policy and verify the agent is healthy and shipping data.
Assign the Agent to the tp-ls policy.
Check the Agent logs and make sure the Endpoint component is not being uninstalled and reinstalled. Concretely, check that there is no log entry for uninstall endpoint service.
Check the Endpoint logs (located under /opt/Elastic/Endpoint/state/log/ on Linux) and make sure that Endpoint has connected to Logstash (or has attempted to and failed if there is no actual Logstash endpoint listening).

Removing Endpoint from policy uninstalls Endpoint

Assign the Agent to the default policy.
Check the Agent logs and make sure the Endpoint component is stopped and uninstalled. Concretely, check that there is a log entry for stopping endpoint service runtime, followed by uninstall endpoint service, followed by Stopped: endpoint service runtime.

Related issues

Resolves Cannot successfully change output type or name in tamper protected agent polices that contain Elastic Defend #11266

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2025-12-11T02:10:47Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

elasticmachine · 2025-12-12T14:54:58Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz · 2025-12-15T18:32:34Z

Dropping the output ID makes sense to me, but I think if you do a policy assignment the input ID also changes, which would cause an uninstall and reinstall.

Does the policy reassignment case work properly for tamper protected agents? Or is this only handling the case where the output is changed within the same policy?

ycombinator · 2025-12-16T17:57:51Z

Dropping the output ID makes sense to me, but I think if you do a policy assignment the input ID also changes, which would cause an uninstall and reinstall.

As discussed in today's meeting, the component ID uses the input type (not input ID) so, with the change in this PR, we should still prevent an uninstall and reinstall.

ycombinator · 2025-12-16T17:59:36Z

Does the policy reassignment case work properly for tamper protected agents? Or is this only handling the case where the output is changed within the same policy?

~~I will test this along with the upgrade scenario.~~

Yes, it does. I forgot that that's how I'd tested this PR to begin with. I even added manual testing steps to the PR's description. 🤦

I will test the upgrade scenario next.

ycombinator · 2025-12-17T08:54:42Z

I will test the upgrade scenario next.

Okay, so upgrading works in that nothing is broken after upgrade. Here's the output of elastic-agent status before, during, and after the upgrade:

Before

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: a4d2bed3-5c18-42d0-bc55-cc838f2a43a6
   │  ├─ version: 9.2.0-SNAPSHOT
   │  └─ commit: 07d5c3a34e486f81130d7177678c547944dbb84c
   └─ endpoint-default
      ├─ status: (HEALTHY) Healthy: communicating with endpoint service
      ├─ endpoint-default
      │  ├─ status: (HEALTHY) Applied policy 'defend' (Defend policy rev. 1. Agent policy rev. 21.)
      │  └─ type: OUTPUT
      └─ endpoint-default-a92a4f1c-8773-43ce-8e97-062fc2cebb1f
         ├─ status: (HEALTHY) Applied policy 'defend' (Defend policy rev. 1. Agent policy rev. 21.)
         └─ type: INPUT

During

┌─ fleet
│  └─ status: (STARTING)
├─ elastic-agent
│  ├─ status: (HEALTHY) Running
│  ├─ info
│  │  ├─ id: 6c6af8b5-7e88-42d0-8d71-70afbcfa39f1
│  │  ├─ version: 9.3.0
│  │  └─ commit: 7026d24f297cdcd6826da98ec3c39ea7f0c59c19
│  └─ endpoint
│     ├─ status: (HEALTHY) Healthy: communicating with endpoint service
│     ├─ endpoint
│     │  ├─ status: (HEALTHY) Applied policy 'defend' (Defend policy rev. 1. Agent policy rev. 21.)
│     │  └─ type: OUTPUT
│     └─ endpoint-a92a4f1c-8773-43ce-8e97-062fc2cebb1f
│        ├─ status: (HEALTHY) Applied policy 'defend' (Defend policy rev. 1. Agent policy rev. 21.)
│        └─ type: INPUT
└─ upgrade_details
   ├─ target_version: 9.3.0
   ├─ state: UPG_WATCHING
   ├─ action_id: 349b43f0-973b-4b40-ab21-ee082debcfaf
   └─ metadata

After

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 6c6af8b5-7e88-42d0-8d71-70afbcfa39f1
   │  ├─ version: 9.3.0
   │  └─ commit: 7026d24f297cdcd6826da98ec3c39ea7f0c59c19
   └─ endpoint
      ├─ status: (HEALTHY) Healthy: communicating with endpoint service
      ├─ endpoint
      │  ├─ status: (HEALTHY) Applied policy 'defend' (Defend policy rev. 1. Agent policy rev. 21.)
      │  └─ type: OUTPUT
      └─ endpoint-a92a4f1c-8773-43ce-8e97-062fc2cebb1f
         ├─ status: (HEALTHY) Applied policy 'defend' (Defend policy rev. 1. Agent policy rev. 21.)
         └─ type: INPUT```

Notice that the ID of the Endpoint component after the upgrade changes to `endpoint`.  Specifically, the ID no longer has the output ID, `default`, as the suffix.

cmacknz · 2025-12-17T21:43:37Z

What's the right additional testing to add for this? It looks correct by inspection, but I think we should add something proving it does what we think.

If we could get a simulated policy reassignment (input id: key change) and output change as a test of just the component model that's probably the lightest thing we could do.

If policy reassignment always fails without this change, that's a test case that is missing from the endpoint integration tests, so we could add that. I also don't think we have a test that does an upgrade with defend installed but that is even heavier, though we have had lots of problems around this discovered in the field so maybe worth it.

ycombinator · 2025-12-22T05:21:33Z

If we could get a simulated policy reassignment (input id: key change) and output change as a test of just the component model that's probably the lightest thing we could do.

Added a unit test case for input ID change to TestComponentUpdateDiff in cf89151.

A unit test that tests just changing the output already exists in TestComponentUpdateDiff:

elastic-agent/internal/pkg/agent/application/coordinator/coordinator_test.go

Line 208 in eb1cd7f

name: "just-change-output",

But I added additional assertions for that test case to make sure no components were added, removed, or updated in the component model as a result of such a change: 4e352a7#diff-21de8fb82a5ebccaa0ca3afb0cb8bfbe10d5b8f920021785c2d485ee35a3556cR225-R227

go.mod

testing/integration/ess/endpoint_security_test.go

ycombinator · 2025-12-22T07:11:21Z

If policy reassignment always fails without this change, that's a test case that is missing from the endpoint integration tests, so we could add that.

Added in 8a1a8c3

ycombinator · 2025-12-22T07:15:33Z

I also don't think we have a test that does an upgrade with defend installed...
Looks like we do have some integration tests around upgrading Endpoint:

elastic-agent/testing/integration/ess/endpoint_security_test.go

Line 65 in eb1cd7f

func TestUpgradeAgentWithTamperProtectedEndpoint_DEB(t *testing.T) {

elastic-agent/testing/integration/ess/endpoint_security_test.go

Line 99 in eb1cd7f

func TestUpgradeAgentWithTamperProtectedEndpoint_RPM(t *testing.T) {

go.mod

changelog/fragments/1765439361-service-component-avoid-stop-start.yaml

testing/integration/ess/endpoint_security_test.go

…essful policy reassignment

ycombinator · 2026-01-06T07:21:30Z

@cmacknz Thanks for the review. I see you added the backport-9.3 label on this PR. Being a bug fix, should we backport it to all active branches?

elasticmachine · 2026-01-06T07:32:55Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 284dcb8

Failed CI Steps

History

💔 Build #32772 failed f70ec1f
💔 Build #32652 failed aa3188b
💔 Build #32608 failed e2c6373
💔 Build #32552 failed cdb0764
💔 Build #32541 failed ea763b6

cc @ycombinator

…nge (#11740) * Add UsesCommandRuntime and UsesServiceRuntime methods on Component * Use new methods * Add test case for only output being changed on service component * Implement logic to not remove and add same service component * Adding CHANGELOG fragment * Improve comment * Fix logic location * Update unit test * Update service component naming * Refactor: extract logic into helper method * Relocate unit test and add lots of cases * Remove unnecessary code * Clarify comments * Remove unnecessary unit test * Undo unnecessary changes * Update component ID in integration test * Add assertions on lengths of components added, removed, updated * Add test case for only input ID changing * Add integration test: TestPolicyReassignWithTamperProtectedEndpoint * Update replace in go.mod * Bump up context timeout and use for entire test * Define fixture * Fix syntax errors * Fix installOpts * Only cleanup Endpoint using first policy's uninstall token until successful policy reassignment * Clarify log message * Upgrade endpoint package version * Use exec.CommandContext and separate out args * Compare Endpoint policy IDs * Use agentID from enrollment response * Install Elastic Defend in second policy * Add endpoint cleanup after reassigning policy * Fixing log messages * Give Endpoint time to receive reassigned policy * Updating dependency version * Adding log statements * Remove replace * Remove duplicate CHANGELOG fragment * Remove PID checks (cherry picked from commit c8deb6d)

…nge (#11740) (#12100) * Add UsesCommandRuntime and UsesServiceRuntime methods on Component * Use new methods * Add test case for only output being changed on service component * Implement logic to not remove and add same service component * Adding CHANGELOG fragment * Improve comment * Fix logic location * Update unit test * Update service component naming * Refactor: extract logic into helper method * Relocate unit test and add lots of cases * Remove unnecessary code * Clarify comments * Remove unnecessary unit test * Undo unnecessary changes * Update component ID in integration test * Add assertions on lengths of components added, removed, updated * Add test case for only input ID changing * Add integration test: TestPolicyReassignWithTamperProtectedEndpoint * Update replace in go.mod * Bump up context timeout and use for entire test * Define fixture * Fix syntax errors * Fix installOpts * Only cleanup Endpoint using first policy's uninstall token until successful policy reassignment * Clarify log message * Upgrade endpoint package version * Use exec.CommandContext and separate out args * Compare Endpoint policy IDs * Use agentID from enrollment response * Install Elastic Defend in second policy * Add endpoint cleanup after reassigning policy * Fixing log messages * Give Endpoint time to receive reassigned policy * Updating dependency version * Adding log statements * Remove replace * Remove duplicate CHANGELOG fragment * Remove PID checks (cherry picked from commit c8deb6d) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

cmacknz · 2026-01-07T14:14:34Z

@cmacknz Thanks for the review. I see you added the backport-9.3 label on this PR. Being a bug fix, should we backport it to all active branches?

I think we need to release this in 9.3 first, and if there are no unintended problems once we get through the 9.3 testing cycle we can backport.

The difference between 9.3 and 9.2+9.1 is time to release, if we had backported to the already released minors this would have released very quickly after merge with no soak time.

ycombinator · 2026-02-17T20:04:22Z

I think we need to release this in 9.3 first, and if there are no unintended problems once we get through the 9.3 testing cycle we can backport.

@cmacknz I think we're good to backport this PR now?

cmacknz · 2026-02-17T21:11:19Z

Possibly, 9.3.0 has not been available for that long yet. Unless someone asks us to backport this I would leave it in 9.3 only to be conservative.

mergify bot assigned ycombinator Dec 11, 2025

ycombinator changed the title ~~Avoid stopping and stopping service components on policy change~~ Dec 12, 2025

ycombinator force-pushed the service-component-avoid-stop-start branch from 84a4523 to 1951fec Compare December 12, 2025 14:53

ycombinator marked this pull request as ready for review December 12, 2025 14:54

ycombinator requested a review from a team as a code owner December 12, 2025 14:54

ycombinator requested review from blakerouse and michel-laterman December 12, 2025 14:54

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Dec 12, 2025

ycombinator changed the title ~~Avoid starting and stopping service components on policy change~~ Dec 16, 2025

cmacknz added the backport-9.3 Automated backport to the 9.3 branch label Dec 17, 2025

ycombinator force-pushed the service-component-avoid-stop-start branch 2 times, most recently from e892133 to 6fadd3c Compare December 22, 2025 04:49

ycombinator mentioned this pull request Dec 22, 2025

Added ReassignAgentToPolicy method to Kibana Fleet client elastic/elastic-agent-libs#380

Merged

3 tasks

ycombinator commented Dec 22, 2025

View reviewed changes

go.mod Outdated Show resolved Hide resolved

ycombinator commented Dec 22, 2025

View reviewed changes

testing/integration/ess/endpoint_security_test.go Outdated Show resolved Hide resolved

ycombinator force-pushed the service-component-avoid-stop-start branch from 13f5ff1 to 8a1a8c3 Compare December 22, 2025 07:13

ycombinator requested a review from cmacknz December 22, 2025 07:15

ycombinator commented Dec 22, 2025

View reviewed changes

go.mod Outdated Show resolved Hide resolved

cmacknz reviewed Dec 29, 2025

View reviewed changes

changelog/fragments/1765439361-service-component-avoid-stop-start.yaml Outdated Show resolved Hide resolved

testing/integration/ess/endpoint_security_test.go Outdated Show resolved Hide resolved

ycombinator force-pushed the service-component-avoid-stop-start branch from 0bac7f9 to f70ec1f Compare December 30, 2025 09:12

ycombinator added 15 commits January 6, 2026 11:07

Only cleanup Endpoint using first policy's uninstall token until succ…

9ce33b9

…essful policy reassignment

Clarify log message

934df0f

Upgrade endpoint package version

65b1cdd

Use exec.CommandContext and separate out args

fa5c270

Compare Endpoint policy IDs

3ad5bf2

Use agentID from enrollment response

414377e

Install Elastic Defend in second policy

0821071

Add endpoint cleanup after reassigning policy

66ef270

Fixing log messages

4dce40d

Give Endpoint time to receive reassigned policy

d50b56a

Updating dependency version

4b6175c

Adding log statements

09b4171

Remove replace

47fdaa4

Remove duplicate CHANGELOG fragment

95e43a6

Remove PID checks

284dcb8

ycombinator force-pushed the service-component-avoid-stop-start branch from f70ec1f to 284dcb8 Compare January 6, 2026 05:38

ycombinator enabled auto-merge (squash) January 6, 2026 05:38

ycombinator disabled auto-merge January 6, 2026 07:21

ycombinator enabled auto-merge (squash) January 6, 2026 07:21

ycombinator merged commit c8deb6d into elastic:main Jan 6, 2026
22 checks passed

ycombinator deleted the service-component-avoid-stop-start branch January 6, 2026 08:11

mergify bot mentioned this pull request Jan 6, 2026

[9.3] (backport #11740) Avoid uninstalling and re-installing service components on policy change #12100

Merged

8 tasks

cmacknz mentioned this pull request Jan 12, 2026

Use beat receiver subcomponent status for status translation #11856

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid uninstalling and re-installing service components on policy change#11740

Avoid uninstalling and re-installing service components on policy change#11740
ycombinator merged 39 commits intoelastic:mainfrom
ycombinator:service-component-avoid-stop-start

ycombinator commented Dec 11, 2025 •

edited

Loading

mergify bot commented Dec 11, 2025

elasticmachine commented Dec 12, 2025

cmacknz commented Dec 15, 2025

ycombinator commented Dec 16, 2025

ycombinator commented Dec 16, 2025 •

edited

Loading

ycombinator commented Dec 17, 2025

cmacknz commented Dec 17, 2025

ycombinator commented Dec 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

ycombinator commented Dec 22, 2025 •

edited

Loading

ycombinator commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

ycombinator commented Jan 6, 2026

elasticmachine commented Jan 6, 2026

Uh oh!

cmacknz commented Jan 7, 2026

ycombinator commented Feb 17, 2026

cmacknz commented Feb 17, 2026

Labels

3 participants

Conversation

ycombinator commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Policy reassign does not uninstall/reinstall Endpoint

Removing Endpoint from policy uninstalls Endpoint

Related issues

Questions to ask yourself

mergify bot commented Dec 11, 2025

elasticmachine commented Dec 12, 2025

cmacknz commented Dec 15, 2025

ycombinator commented Dec 16, 2025

ycombinator commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ycombinator commented Dec 17, 2025

Before

During

After

cmacknz commented Dec 17, 2025

ycombinator commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ycombinator commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ycombinator commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

ycombinator commented Jan 6, 2026

elasticmachine commented Jan 6, 2026

💛 Build succeeded, but was flaky

Failed CI Steps

History

Uh oh!

cmacknz commented Jan 7, 2026

ycombinator commented Feb 17, 2026

cmacknz commented Feb 17, 2026

Labels

3 participants

ycombinator commented Dec 11, 2025 •

edited

Loading

ycombinator commented Dec 16, 2025 •

edited

Loading

ycombinator commented Dec 22, 2025 •

edited

Loading

ycombinator commented Dec 22, 2025 •

edited

Loading