Skip to content

Report recent tasks updates when master starved#139518

Merged
DaveCTurner merged 5 commits intoelastic:mainfrom
DaveCTurner:2025/12/15/MasterService-execution-history
Dec 15, 2025
Merged

Report recent tasks updates when master starved#139518
DaveCTurner merged 5 commits intoelastic:mainfrom
DaveCTurner:2025/12/15/MasterService-execution-history

Conversation

@DaveCTurner
Copy link
Contributor

Today if the elected master is unable to clear its queue for too long we
log the warning pending task queue has been nonempty for [${DURATION}]
but it can be challenging to determine what is keeping it busy like
this. With this commit we add some simple tracking of recent cluster
state updates and a log message to report the updates executed recently.

Today if the elected master is unable to clear its queue for too long we
log the warning `pending task queue has been nonempty for [${DURATION}]`
but it can be challenging to determine what is keeping it busy like
this. With this commit we add some simple tracking of recent cluster
state updates and a log message to report the updates executed recently.
@DaveCTurner DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. v9.3.0 labels Dec 15, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Dec 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner DaveCTurner requested a review from a team as a code owner December 15, 2025 12:18
maxTaskWaitTime.millis()
);

if (logger.isInfoEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why in a separate log line and not with the warn above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect we might want to filter this one out separately (it could be quite long) and I believe we have dashboards looking at the warning so I didn't want to change it too much either

Copy link
Contributor

@bcully bcully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +1212 to +1216
Strings.collectionToDelimitedStringWithLimit(
(Iterable<String>) (() -> Iterators.map(executionHistory.iterator(), ExecutionHistoryEntry::getDescription)),
", ",
MAX_TASK_DESCRIPTION_CHARS,
descriptionBuilder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, thanks!

I expect we'll see a bunch of duplicate lines. We might be able to get deeper history if we collected runs of the same record into a single record + count line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes that's true, tho then we would lose the ordering which I think is going to be more informative in many cases.

I'll proceed with this for now, and we can follow up with a change to report counts grouped by queue name if it turns out it's still needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I had in mind to collect runs together in order to keep the ordering, rather than producing only a task/count table, e.g.:

1-20: HIGH unbatched task-queue-1,
21: HIGH unbatched task-queue-2,
22-33: HIGH unbatched task-queue-1,
...

But yes, we can see if that would be helpful later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I see. I opened #139555 to do that. I suspect in the case of shard allocation it's not that useful as we'll be going round a loop of different tasks (allocate a shard and then mark the shard as started) but yes it might be nicer in other cases.

@DaveCTurner DaveCTurner merged commit 082205e into elastic:main Dec 15, 2025
35 checks passed
@DaveCTurner DaveCTurner deleted the 2025/12/15/MasterService-execution-history branch December 15, 2025 17:49
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Dec 15, 2025
Following elastic#139518, this commit groups together consecutive equal entries
in the log to represent the same information more densely.
parkertimmins pushed a commit to parkertimmins/elasticsearch that referenced this pull request Dec 17, 2025
Today if the elected master is unable to clear its queue for too long we
log the warning `pending task queue has been nonempty for [${DURATION}]`
but it can be challenging to determine what is keeping it busy like
this. With this commit we add some simple tracking of recent cluster
state updates and a log message to report the updates executed recently.
DaveCTurner added a commit that referenced this pull request Jan 8, 2026
Following #139518, this commit groups together consecutive equal entries
in the log to represent the same information more densely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.0

4 participants