Single query-frontend pod has large spike in latency after k8s node upgrade

Describe the bug
While performing a GKE node upgrade on more than one occasion, we've experienced a situation where the P99 latency on a single query-frontend jumps from near 0 to 90s. I'm unsure how the frontend works WRT to HA, but all users start to see timeouts effectively breaking all queries.

This is not new AFAIK. The first time we noticed it was in Oct 2024 and we try to keep our loki up to date so we've been through multiple versions. This also happened in Jan 2025.

The only other thing we noticed is that the single pod starts to complain about ast mappings with our loki canary. Here's an example log line that only appears on the one bad frontend:

ts=2025-06-19T01:07:44.869377037Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="count_over_time({stream=\"stdout\",pod=\"loki-main-canary-pclr4\"}[462s])"

To Reproduce
Steps to reproduce the behavior:

Have queriers set to pull mode, have query-frontend HA setup and use a separate scheduler. Also use the loki canary.
Perform a kubernetes upgrade or possibly whatever will cause every node to be drained with replacements? I would think this does not matter, but perhaps the behaviour is different from a kubectl rollout restart enough that the hashring or something has an issue.
All queries have issues and one query-frontend has increased latency.

Expected behavior
Able to handle a k8s node upgrade

Environment:

Infrastructure: kubernetes
Deployment tool: helm
Version: Currently 3.4.2 but has been on other 3.X versions

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single query-frontend pod has large spike in latency after k8s node upgrade #18186

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Single query-frontend pod has large spike in latency after k8s node upgrade #18186

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions