Report recent tasks updates when master starved#139518
Report recent tasks updates when master starved#139518DaveCTurner merged 5 commits intoelastic:mainfrom
Conversation
Today if the elected master is unable to clear its queue for too long we
log the warning `pending task queue has been nonempty for [${DURATION}]`
but it can be challenging to determine what is keeping it busy like
this. With this commit we add some simple tracking of recent cluster
state updates and a log message to report the updates executed recently.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @DaveCTurner, I've created a changelog YAML for you. |
| maxTaskWaitTime.millis() | ||
| ); | ||
|
|
||
| if (logger.isInfoEnabled()) { |
There was a problem hiding this comment.
why in a separate log line and not with the warn above?
There was a problem hiding this comment.
I expect we might want to filter this one out separately (it could be quite long) and I believe we have dashboards looking at the warning so I didn't want to change it too much either
| Strings.collectionToDelimitedStringWithLimit( | ||
| (Iterable<String>) (() -> Iterators.map(executionHistory.iterator(), ExecutionHistoryEntry::getDescription)), | ||
| ", ", | ||
| MAX_TASK_DESCRIPTION_CHARS, | ||
| descriptionBuilder |
There was a problem hiding this comment.
This is nice, thanks!
I expect we'll see a bunch of duplicate lines. We might be able to get deeper history if we collected runs of the same record into a single record + count line?
There was a problem hiding this comment.
Hmm yes that's true, tho then we would lose the ordering which I think is going to be more informative in many cases.
I'll proceed with this for now, and we can follow up with a change to report counts grouped by queue name if it turns out it's still needed.
There was a problem hiding this comment.
Just to be clear, I had in mind to collect runs together in order to keep the ordering, rather than producing only a task/count table, e.g.:
1-20: HIGH unbatched task-queue-1,
21: HIGH unbatched task-queue-2,
22-33: HIGH unbatched task-queue-1,
...
But yes, we can see if that would be helpful later.
There was a problem hiding this comment.
Ah ok I see. I opened #139555 to do that. I suspect in the case of shard allocation it's not that useful as we'll be going round a loop of different tasks (allocate a shard and then mark the shard as started) but yes it might be nicer in other cases.
Following elastic#139518, this commit groups together consecutive equal entries in the log to represent the same information more densely.
Today if the elected master is unable to clear its queue for too long we
log the warning `pending task queue has been nonempty for [${DURATION}]`
but it can be challenging to determine what is keeping it busy like
this. With this commit we add some simple tracking of recent cluster
state updates and a log message to report the updates executed recently.
Following #139518, this commit groups together consecutive equal entries in the log to represent the same information more densely.
Today if the elected master is unable to clear its queue for too long we
log the warning
pending task queue has been nonempty for [${DURATION}]but it can be challenging to determine what is keeping it busy like
this. With this commit we add some simple tracking of recent cluster
state updates and a log message to report the updates executed recently.