Batch ILM policy cluster state updates [#122917]#126529
Batch ILM policy cluster state updates [#122917]#126529lukewhiting merged 8 commits intoelastic:mainfrom
Conversation
There was a problem hiding this comment.
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java:354
- The 'logger' reference in the IlmLifecycleExecutor inner class is not clearly declared within its scope; please ensure it is properly defined or referenced from the outer class to avoid compilation issues.
logger.trace("Executed lifecycle policy update:\n{}", task.request.getPolicy().toString());
| "put-lifecycle-" + request.getPolicy().getName(), | ||
| new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client) | ||
| new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client), | ||
| null |
There was a problem hiding this comment.
I think an infinite timeout is the right call here but would be good to get options
There was a problem hiding this comment.
The current behavior seems to be using request.masterNodeTimeout(). We submit the unbatched task using the timeout from the task here:
and the
UpdateLifecyclePolicyTask initializes the timeout here:There was a problem hiding this comment.
Right yes that makes sense as as it's triggered from an upstream request we should mirror that. Have switched the code to use the timeout on the task object.
| taskContext.success(() -> taskSucceeded(task, taskResult)); | ||
| Runnable successFunction = () -> taskSucceeded(task, taskResult); | ||
| if (task instanceof ClusterStateAckListener ackListenerTask) { | ||
| taskContext.success(successFunction, ackListenerTask); |
There was a problem hiding this comment.
This is to work around the assertion at
There was a problem hiding this comment.
I didn't check the rest of your PR properly yet, so maybe this is premature, but we also have SimpleBatchedAckListenerTaskExecutor. Wouldn't that work instead - avoiding these changes?
There was a problem hiding this comment.
Didn't spot that class but yes that works nicely :-) Have refactored it to use the new class and the ILM tests seem to be happy about it.
There was a problem hiding this comment.
Apologies for not mentioning that before when I suggested using SimpleBatchedExecutor...
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Hi @lukewhiting, I've created a changelog YAML for you. |
|
Hi @lukewhiting, I've updated the changelog YAML for you. |
| try { | ||
| return Tuple.tuple(task.execute(clusterState), task); | ||
| } catch (Exception e) { | ||
| throw new ElasticsearchException("failed to execute task", e); | ||
| } |
There was a problem hiding this comment.
Is there a reason you added the try-catch block here? The SimpleBatchedAckListenerTaskExecutor already catches all exceptions and calls onFailure here:
It looks to me like that's what we want, but maybe I'm missing something.
There was a problem hiding this comment.
No just my IDE being fussy and not correctly adding the throws block to the method it generated for the interface -.- Have updated it now.
| private static class IlmLifecycleExecutor extends SimpleBatchedAckListenerTaskExecutor<UpdateLifecyclePolicyTask> { | ||
|
|
||
| @Override | ||
| public Tuple<ClusterState, ClusterStateAckListener> executeTask(UpdateLifecyclePolicyTask task, ClusterState clusterState) | ||
| throws Exception { | ||
| return Tuple.tuple(task.execute(clusterState), task); | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
Since this class is so small now, I think we could also make it a lambda in the constructor, but if you prefer this version, I'm also ok with that - no strong opinion here.
There was a problem hiding this comment.
SimpleBatchedAckListenerTaskExecutor is an abstract rather than an @FunctionalInterface so I don't think we can go as a simple as a Lambda here. We could use an inline anonymous class here but my preference is always named classes for stuff like this as it improved readability but could be persuaded otherwise for something this small...
There was a problem hiding this comment.
Ah yeah you're right. No then I definitely prefer this named class as well 👍
|
When this change goes in, I'm going to close #111431, #111632, and #111662. Those failures seemed to be caused by slow cluster startup and this PR should reduce that. I'll close them manually as I don't want them to be linked in the changelog - and this PR is only indirectly addressing them at best. |
nielsbauman
left a comment
There was a problem hiding this comment.
LGTM, thanks for working on this, Luke! An easy but impactful win :) 🚀
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
* Use a task queue to ensure ILM policy change cluster state updates are batched * Update docs/changelog/126529.yaml * Update docs/changelog/126529.yaml * Switch to using SimpleBatchedAckListenerTaskExecutor * Get timeout from request * Ditch the try-catch (cherry picked from commit 7d7fa76) # Conflicts: # x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java
Switches ILM PUT lifecycle action to use a batched task queue for executing cluster state updates.
This isn't an optimal solution as it could use less heap space by not creating a new cluster state for each change but it should greatly speed up execution time when performing many ILM updates such as in tests without being significantly worse than current implementation on heap usage.
Future Enhancements:
Refactor the task to not use the
ClusterStateAckListenerbut instead use a cluster state builder as it's input / output of the execute function. This would allow us to combine multiple updates while only using a single builder object.This would need a fair bit of refactoring, both upstream and downstream of this class (For instance a new equivalent of
SimpleBatchExecutorthat works on builders instead of cluster states.Fixes #122917