Description
Describe the bug
While performing a GKE node upgrade on more than one occasion, we've experienced a situation where the P99 latency on a single query-frontend jumps from near 0 to 90s. I'm unsure how the frontend works WRT to HA, but all users start to see timeouts effectively breaking all queries.
This is not new AFAIK. The first time we noticed it was in Oct 2024 and we try to keep our loki up to date so we've been through multiple versions. This also happened in Jan 2025.
The only other thing we noticed is that the single pod starts to complain about ast mappings with our loki canary. Here's an example log line that only appears on the one bad frontend:
ts=2025-06-19T01:07:44.869377037Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="count_over_time({stream=\"stdout\",pod=\"loki-main-canary-pclr4\"}[462s])"
To Reproduce
Steps to reproduce the behavior:
- Have queriers set to
pull
mode, have query-frontend HA setup and use a separate scheduler. Also use the loki canary. - Perform a kubernetes upgrade or possibly whatever will cause every node to be drained with replacements? I would think this does not matter, but perhaps the behaviour is different from a
kubectl rollout restart
enough that the hashring or something has an issue. - All queries have issues and one query-frontend has increased latency.
Expected behavior
Able to handle a k8s node upgrade
Environment:
- Infrastructure: kubernetes
- Deployment tool: helm
- Version: Currently 3.4.2 but has been on other 3.X versions
Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.