Skip to content

ILM LifecycleExecutionState#stepInfo adds unbounded data to cluster state #124181

@DaveCTurner

Description

@DaveCTurner

In a recent outage in which a cluster state grew too large to fit into a single transport message, we discovered that most of the space was being taken up by a LifecycleExecutionState#stepInfo of the following form:

{"type":"repository_exception","reason":"[found-snapshots] failed to delete snapshots [...6500 snapshot names elided...]","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [null]","caused_by":{"type":"sdk_client_exception","reason":"Unable to execute HTTP request: Read timed out","caused_by":{"type":"socket_timeout_exception","reason":"Read timed out"}}}}

The [...6500 snapshot names elided...] is about 750kiB. And because this error affected around 6500 indices this message was duplicated that many times in the cluster state, which added up to a little over 4.6GiB.

Note that this is different from the problem that #84266 fixes, it's not stack traces taking all the space, it's just the top-level exception message itself.

It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else? If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?

Relates #124183

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Data Management/ILM+SLMDO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead.>bugTeam:Data Management (obsolete)DO NOT USE. This team no longer exists.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions