-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Description
In a recent outage in which a cluster state grew too large to fit into a single transport message, we discovered that most of the space was being taken up by a LifecycleExecutionState#stepInfo of the following form:
{"type":"repository_exception","reason":"[found-snapshots] failed to delete snapshots [...6500 snapshot names elided...]","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [null]","caused_by":{"type":"sdk_client_exception","reason":"Unable to execute HTTP request: Read timed out","caused_by":{"type":"socket_timeout_exception","reason":"Read timed out"}}}}
The [...6500 snapshot names elided...] is about 750kiB. And because this error affected around 6500 indices this message was duplicated that many times in the cluster state, which added up to a little over 4.6GiB.
Note that this is different from the problem that #84266 fixes, it's not stack traces taking all the space, it's just the top-level exception message itself.
It's certainly useful to have information about ILM errors stored somewhere for future investigation, but could it be somewhere else? If it has to be in the cluster state, could we put a length limit on it just to protect against such a pathological case?
Relates #124183