queries are not processed if frontend connects to scheduler too early

Describe the bug
We observe that quite often when the scheduler restarts, querying stop workings.
We run Loki in microservice mode with use_scheduler_ring = true on Nomad.

After putting a lot of log statements into Loki we observed that the frontend reconnects to the scheduler before the scheduler shouldRun state is set to true, which is initially false as we use the use_scheduler_ring = true. For some timing reason (memberlist state?) the frontend reconnects before that which then causes the scheduler skip the response to frontends INIT .
Thus the frontend starts piling up in progress query requests until they timeout.

As I haven't any issue yet and this happens quite often, there might be something wonky in our config. Still I think this can be considered a bug as system locks in a strange unresponsive way.

To Reproduce
I'm still facing issue to reproduce this in a deterministic way as I've not understand the full memberlist/scheduler ring/frontend/scheduler connect protocol. Happy to get some pointers here

Expected behavior
Scheduler restart, frontend reconnects, frontends continues scheduling queries
Environment:

Loki 3.5.7 (also experienced earlier), microservice mode, memberlist, use_scheduler_ring
Infrastructure: Nomad, Consul, Vault
Deployment tool: Nomad Job files :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

queries are not processed if frontend connects to scheduler too early #19528

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

queries are not processed if frontend connects to scheduler too early #19528

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions