What is the bug?
When a ruler fleet is undergoing natural churn, e.g. normal pod evictions, any querying of the ruler for rules at that time leads to returning a 500.
For example, for the following request, GET /prometheus/api/v1/rules?type=alert, the server responds with the following:
{
"status": "error",
"data": null,
"errorType": "server_error",
"error": "too many unhealthy instances in the ring"
}
How to reproduce it?
Start a normal ruler fleet and terminate one pod gracefully while issuing a command to retrieve rule details.
What did you think would happen?
No 500s
What was your environment?
Infrastructure - Kubernetes v1.30.12
deployment tool - custom
version - 2.16.1
Any additional context to share?
No response