ILM LifecycleExecutionState#stepInfo adds unbounded data to cluster state

In a recent outage in which a cluster state grew too large to fit into a single transport message, we discovered that most of the space was being taken up by a LifecycleExecutionState#stepInfo of the following form:

{"type":"repository_exception","reason":"[found-snapshots] failed to delete snapshots [...6500 snapshot names elided...]","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [null]","caused_by":{"type":"sdk_client_exception","reason":"Unable to execute HTTP request: Read timed out","caused_by":{"type":"socket_timeout_exception","reason":"Read timed out"}}}}

The [...6500 snapshot names elided...] is about 750kiB. And because this error affected around 6500 indices this message was duplicated that many times in the cluster state, which added up to a little over 4.6GiB.

Note that this is different from the problem that #84266 fixes, it's not stack traces taking all the space, it's just the top-level exception message itself.

It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else? If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?

Relates #124183

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM `LifecycleExecutionState#stepInfo` adds unbounded data to cluster state #124181

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ILM LifecycleExecutionState#stepInfo adds unbounded data to cluster state #124181

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ILM `LifecycleExecutionState#stepInfo` adds unbounded data to cluster state #124181