-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Describe the bug
We observe that quite often when the scheduler restarts, querying stop workings.
We run Loki in microservice mode with use_scheduler_ring = true on Nomad.
After putting a lot of log statements into Loki we observed that the frontend reconnects to the scheduler before the scheduler shouldRun state is set to true, which is initially false as we use the use_scheduler_ring = true. For some timing reason (memberlist state?) the frontend reconnects before that which then causes the scheduler skip the response to frontends INIT .
Thus the frontend starts piling up in progress query requests until they timeout.
As I haven't any issue yet and this happens quite often, there might be something wonky in our config. Still I think this can be considered a bug as system locks in a strange unresponsive way.
To Reproduce
I'm still facing issue to reproduce this in a deterministic way as I've not understand the full memberlist/scheduler ring/frontend/scheduler connect protocol. Happy to get some pointers here
Expected behavior
Scheduler restart, frontend reconnects, frontends continues scheduling queries
Environment:
- Loki 3.5.7 (also experienced earlier), microservice mode, memberlist, use_scheduler_ring
- Infrastructure: Nomad, Consul, Vault
- Deployment tool: Nomad Job files :)