Skip to content

Bug: retrieving rule details when churn is occurring on ruler fleet causes 500 on server response #12412

@aallawala

Description

@aallawala

What is the bug?

When a ruler fleet is undergoing natural churn, e.g. normal pod evictions, any querying of the ruler for rules at that time leads to returning a 500.

For example, for the following request, GET /prometheus/api/v1/rules?type=alert, the server responds with the following:

{
    "status": "error",
    "data": null,
    "errorType": "server_error",
    "error": "too many unhealthy instances in the ring"
}

How to reproduce it?

Start a normal ruler fleet and terminate one pod gracefully while issuing a command to retrieve rule details.

What did you think would happen?

No 500s

What was your environment?

Infrastructure - Kubernetes v1.30.12
deployment tool - custom
version - 2.16.1

Any additional context to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions