Skip to content

Restrict resource usage during snapshots deletion #131822

@DaveCTurner

Description

@DaveCTurner

Seen in production on a node with 1GiB of heap: a deletion of a collection of snapshots involving one index with several megabytes of index metadata caused the node to go OOM. The issue in this case was ultimately the way that we invoke BlobStoreRepository.SnapshotsDeletion.IndexSnapshotsDeletion#determineShardCount concurrently across all 10 snapshot threads at once. Each thread ended up needing ~50MiB of heap to parse the metadata for this index, and the node couldn't cope. On smaller nodes I guess we shouldn't be doing this work with such high concurrency.

There's also a code comment indicating that we could make this metadata-loading process way more efficient:

// NB since 7.9.0 we deduplicate index metadata blobs, and one of the components of the deduplication key is the
// index UUID; the shard count is going to be the same for all metadata with the same index UUID, so it is
// unnecessary to read multiple metadata blobs corresponding to the same index UUID.
// TODO Skip this unnecessary work? Maybe track the shard count in RepositoryData?

Moreover as noted in #116379 there's a 2GiB limit for the list of blobs to clean up, but a small node would go OOME long before hitting that limit. We should impose a stricter limit on the memory usage for this data structure, ideally spilling the list to storage when reaching the limit but even just forgetting about some blobs would be better than having the node fail.

Relates #108278

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions